.. _cbook_agent_memory: Memory Management Strategies ++++++++++++++++++++++++++++++ Building on Examples 01–03, this example demonstrates memory management for long-running agentic loops. As conversations grow with many LLM turns, agents can manage history via different strategies: sliding window (drop old turns), summarization (compress via a dedicated small LLM), or keep-all. The example also shows a dedicated summarizer LLM on a separate GPU partition to avoid contention with the main reasoning model. **Prerequisites:** Read ``03_hitl_approval.py`` first. **What you'll learn:** * How to configure memory strategies for different agents * ``SLIDING_WINDOW`` — drop old turns, keep last N; lowest cost * ``SUMMARIZE`` — keep last N turns + a summary of older ones (via separate LLM) * ``FULL`` — keep everything (suitable for short-lived agents only) * How to set up a dedicated summarizer LLM on a separate node * How to balance context preservation with token usage and latency **Architecture:** * Node 0, GPUs [0,1]: Main inference service (large reasoning LLM, tp_size=2) * Node 1, GPU [0]: Summarizer service (small LLM, tp_size=1) Main Code ========= Below is the complete example: .. literalinclude:: ../../examples/dragon_ai/ai_agent/04_memory.py :language: python :linenos: :caption: **04_memory.py: Memory strategies and dedicated summarizer** Key Concepts ============ **SLIDING_WINDOW:** Keep the most recent ``max_kept_turns`` conversation turns. Drop older turns and replace with a synthetic "earlier context" note. Lowest memory and token cost; suitable for short-term reasoning tasks. **SUMMARIZE:** Keep the last ``max_kept_turns`` turns in full detail. Older turns are compressed into a summary (generated by a dedicated small LLM) inserted before the kept turns. Good for long-running workflows where full history matters but you want to control latency. **FULL:** Keep all conversation turns from the start. Suitable only for short-lived agents (will run out of context window eventually). Good for understanding agent reasoning in logs. **Dedicated Summarizer LLM:** By running the summarizer on a separate GPU partition (different node, smaller model), the main reasoning agent stays uncontended. This improves throughput and latency of the critical reasoning path. Installation ============ See Example 01 (same dependencies). System Description =================== Tested on HPE Cray EX: * **Node 0** (main inference): 2 Nvidia A100 GPUs * **Node 1** (summarizer): 1 Nvidia A100 GPU * Total: 2 compute nodes required How to Run ========== **Step 1: Edit model paths** Open ``04_memory.py`` and set: * ``MODEL_NAME`` (main reasoning model, e.g., 70B Llama) * ``SUMMARIZER_MODEL_NAME`` (small model, e.g., 7B Llama) **Step 2: Allocate nodes** .. code-block:: console salloc --nodes=2 --exclusive **Step 3: Run** .. code-block:: console dragon 04_memory.py **Example output:** .. code-block:: console $ dragon 04_memory.py Node 0: Starting main inference service (70B model, tp_size=2) Node 1: Starting summarizer service (7B model, tp_size=1) Planner (SLIDING_WINDOW): Turn 1, kept_turns=1/5 Planner (SLIDING_WINDOW): Turn 2, kept_turns=2/5 ... Planner (SLIDING_WINDOW): Turn 6, kept_turns=5/5 (dropped turn 1) Runner (SUMMARIZE): Turn 1, kept_turns=1/10 ... Runner (SUMMARIZE): Turn 11, kept_turns=10/10 + summary of turns 1-5 All agents completed **For HITL approval (in a second terminal):** .. code-block:: console python -m dragon.ai.agent.hitl_client --tcp HOST:PORT Next Steps ========== * **05 — MCP Tools** (integrate remote Model Context Protocol servers)