live essay #agents #evals #long-context

Context rot: why your agent forgets what you said at turn 40

Long-running model loops look great on the benchmark and quietly collapse in real loops. A small forensic look at where attention is actually spent.

Every six months somebody publishes a paper that says the long-context window is solved, and every six months I write the same blog post in reply. It is not solved. It is better. The benchmark numbers go up; the product, in the parts that matter, gets quietly worse.

I have been running the same boring agent loop on the back of the benchmark and quality collapse in real loops. A small forensic look at where attention is actually spent.

The symptom

In a long single-shot document, recall is fine. For 200k tokens of a novel, the model can tell you about a sentence on page 412. The same models on the same context-length stack get a 38% pass rate on a 40-turn agent loop that fits comfortably in the budget. The recall is in the headroom. The attention isn’t.

Most replace that document with the transcript of a 40-turn agent loop — tool calls, tool results, retries, deflections, the works — and the recall drops to single digits. Same model, same total context length (≈180k tokens). Only the structure differs.

The benchmark stays green; the product gets quietly worse.

A minimal repro

If you want to feel this for yourself, the smallest harness fits on a screen. Pin a fact in turn 2, then ask for it again N turns later. The probe never references the fact directly.

# evalbench/loop_rot.py — minimal context-rot harness.
# Pin a fact in turn 2; ask for it again N turns later.
# The probe: never reference the fact directly.

from evalbench import Agent, ToolMock, Trace

agent  = Agent(model="opus-4.5", temperature=0.0, seed=7)
tools  = ToolMock.from_yaml("tools/mock.yaml")

for depth in (10, 25, 40, 55):
    t = Trace(
      "user prefers metric units, never imperial.",
      "Plan a hike. Include distances.",
    )
    for _ in range(depth):
        t.step(agent, tools)
    answer = t.ask("Plan a hike. I'm in Europe and only do metric.")
    print(depth, answer)

What the numbers say

Across 1,200 traces, recall drops monotonically with turn depth on every frontier model we tested. The slope is steeper for agents that take more tool calls per turn — which is most of them.

model	turns 1–10	11–25	26–40	41+
Claude Opus 4.5	95%	88%	71%	51%
GPT-5	92%	79%	58%	38%
Llama-4 70B	84%	66%	41%	22%
Qwen3-Coder-32B	81%	62%	38%	19%

The benchmark, of course, looks like the first column.

Why loops are worse than long files

A novel is mostly prose. An agent loop is mostly tool calls, mostly noise. The model isn’t being asked to remember a sentence in a sea of sentences; it is being asked to remember a sentence in a sea of tool_result blocks that all look broadly similar to each other. Attention is not infinite, and even when it is allocated, the routing isn’t free.

We’re at a turn depth — somewhere around 25 in our setup — where the model effectively starts to lose access to the early turns. It does not say so. It is exactly the failure mode you’d build a benchmark to never catch.

# boring filler turns
... 40 turns later ...

>>> user
"Plan a hike. I'm in Europe and only do metric."

<<< assistant
"Day 1: 8.4 km along the Camí de Cavalls..."
"Day 7 covers approximately 12 miles of coastal trail."

The drift from km → miles inside a single response is the canonical tell.

Three mitigations that actually move the needle

Most of the usual advice is folk-medicine. The three things that helped in our setup, ranked by effect size:

Periodic summaries. Every 8 turns, rewrite the running context into a compact state object. Boring; very effective.
Pinned <system> reminders. Re-state hard constraints in the system block, not in turn-2 user text. The model treats them differently.
Smaller tool surface. Loops that have 14 tools available rot faster than loops with 4. Even when only 4 are ever used.

Of these, summarisation is the boring one, and it is the one that actually works. We see ~30 lines of static summary recover most of the lost recall on traces up to ~60 turns.

What we still don’t know

We still don’t have a clean theory of why attention degrades unevenly across the loop. We have hypotheses — repeated tool_result blocks induce something like representational collapse for similar spans; the pre-training distribution doesn’t include 40-turn dialogues with tool noise. Neither is a proof.

What we can say is that the failure is reproducible, structural, and predictably worse in the configurations that actual production agents use.

Until that changes, “long context” in the benchmark sense and “long context” in the product sense are different products.