.. _cbook_agent_memory:

Memory Management Strategies
++++++++++++++++++++++++++++++

Building on Examples 01–03, this example demonstrates memory management for
long-running agentic loops. As conversations grow with many LLM turns, agents
can manage history via different strategies: sliding window (drop old turns),
summarization (compress via a dedicated small LLM), or keep-all. The example
also shows a dedicated summarizer LLM on a separate GPU partition to avoid
contention with the main reasoning model.

**Prerequisites:** Read ``03_hitl_approval.py`` first.

**What you'll learn:**

* How to configure memory strategies for different agents
* ``SLIDING_WINDOW`` — drop old turns, keep last N; lowest cost
* ``SUMMARIZE`` — keep last N turns + a summary of older ones (via separate LLM)
* ``FULL`` — keep everything (suitable for short-lived agents only)
* How to set up a dedicated summarizer LLM on a separate node
* How to balance context preservation with token usage and latency

**Architecture:**

* Node 0, GPUs [0,1]: Main inference service (large reasoning LLM, tp_size=2)
* Node 1, GPU [0]: Summarizer service (small LLM, tp_size=1)

Main Code
=========

Below is the complete example:

.. literalinclude:: ../../examples/dragon_ai/ai_agent/04_memory.py
    :language: python
    :linenos:
    :caption: **04_memory.py: Memory strategies and dedicated summarizer**


Key Concepts
============

**SLIDING_WINDOW:**

Keep the most recent ``max_kept_turns`` conversation turns. Drop older turns
and replace with a synthetic "earlier context" note. Lowest memory and token
cost; suitable for short-term reasoning tasks.

**SUMMARIZE:**

Keep the last ``max_kept_turns`` turns in full detail. Older turns are
compressed into a summary (generated by a dedicated small LLM) inserted
before the kept turns. Good for long-running workflows where full history
matters but you want to control latency.

**FULL:**

Keep all conversation turns from the start. Suitable only for short-lived
agents (will run out of context window eventually). Good for understanding
agent reasoning in logs.

**Dedicated Summarizer LLM:**

By running the summarizer on a separate GPU partition (different node, smaller
model), the main reasoning agent stays uncontended. This improves throughput
and latency of the critical reasoning path.

Installation
============

See Example 01 (same dependencies).

System Description
===================

Tested on HPE Cray EX:

* **Node 0** (main inference): 2 Nvidia A100 GPUs
* **Node 1** (summarizer): 1 Nvidia A100 GPU
* Total: 2 compute nodes required

How to Run
==========

**Step 1: Edit model paths**

Open ``04_memory.py`` and set:

* ``MODEL_NAME`` (main reasoning model, e.g., 70B Llama)
* ``SUMMARIZER_MODEL_NAME`` (small model, e.g., 7B Llama)

**Step 2: Allocate nodes**

.. code-block:: console

    salloc --nodes=2 --exclusive

**Step 3: Run**

.. code-block:: console

    dragon 04_memory.py

**Example output:**

.. code-block:: console

    $ dragon 04_memory.py
    Node 0: Starting main inference service (70B model, tp_size=2)
    Node 1: Starting summarizer service (7B model, tp_size=1)
    Planner (SLIDING_WINDOW): Turn 1, kept_turns=1/5
    Planner (SLIDING_WINDOW): Turn 2, kept_turns=2/5
    ...
    Planner (SLIDING_WINDOW): Turn 6, kept_turns=5/5 (dropped turn 1)
    Runner (SUMMARIZE): Turn 1, kept_turns=1/10
    ...
    Runner (SUMMARIZE): Turn 11, kept_turns=10/10 + summary of turns 1-5
    All agents completed

**For HITL approval (in a second terminal):**

.. code-block:: console

    python -m dragon.ai.agent.hitl_client --tcp HOST:PORT

Next Steps
==========

* **05 — MCP Tools** (integrate remote Model Context Protocol servers)