I spent a few days reading everything I could find on context engineering. Eight articles, two arXiv papers, one Thoughtworks Radar entry, and Anthropic's own engineering blog. Somewhere around article five I noticed something: the discipline had clearly already happened, people were writing about specific techniques, but nobody had organized them into a single picture of how they fit together.
So this is that. Not a survey, not a "comprehensive guide" -- a map. The kind of thing I'd have wanted at the start of those few days.
What is context engineering?
Anthropic defines it as the deliberate curation of every token that goes into the context window. The full framing is here, and it's worth reading once, but the shift it describes is simple: you're not trying to write a better instruction anymore. You're managing a resource budget. Prompt engineering asked "how do I phrase this?" Context engineering asks "what should even be in here, and what should I throw out?"
That reframe is load-bearing for everything else in this post.
Context rot
Before getting into techniques, it helps to understand why this matters at all. Chroma published research on what they call context rot: the finding that LLMs don't degrade gracefully as input length grows. They extended the Needle-in-a-Haystack benchmark to use semantic matching rather than just lexical lookup, and found that performance doesn't taper -- it collapses, non-uniformly, even on models marketed as long-context capable.
The problem isn't that the model can't fit more tokens. It's that it can't reliably use them. Every technique in the rest of this post is a response to that.
Three layers cover most of what I found across those eight articles. Budget, structure, and feedback. They don't depend on each other, but they map cleanly to the different failure modes.
| Layer | What it controls | Techniques |
|---|---|---|
| Budget | Token count and cost | Prompt Caching, Active Context Compression |
| Structure | What enters the window | Context Graphs, Progressive Context Disclosure |
| Feedback | Self-correction loops | Back Pressure, Dynamic Tool Selection |
Each layer stands on its own. Add one, get a real improvement. Add all three, and the gains stack.
Layer 1: The Budget
Token count and cost. This layer comes first because it's the one that bites you before you've thought about architecture at all -- you hit a $300 bill on a prototype and suddenly the context window isn't an abstract problem anymore.
Prompt Caching
If you're sending a 10,000-token system prompt on every API call and you haven't set up caching, you're paying full price on every single request. The fix is a one-line annotation in your request structure.
The mechanism is cache_control breakpoints. You add them at the last stable block in your prompt -- system prompt, large document, codebase context, whatever doesn't change across requests. Cache reads cost 0.1x the base input token price (90% off), cache writes cost 1.25x. So the payoff depends on your read/write ratio. A system prompt that's static across hundreds of calls? Obvious win. Something that changes on every request? Don't bother.
Practical constraints: you can set up to 4 breakpoints per request. The minimum eligible block is 1,024 tokens (varies by model -- check the docs). TTL is 5 minutes by default with a 1-hour extended option.
Full docs at platform.claude.com/docs/en/build-with-claude/prompt-caching.
Active Context Compression (ACC)
Prompt caching reduces what you pay to send stable content. ACC handles the other problem: what happens to the context window when it fills up mid-run.
Long agent sessions accumulate noise. Tool call scaffolding from 30 steps ago, redundant turn history, intermediate reasoning that was useful once and isn't anymore. The model nominally supports the context length, but as Chroma's context rot research showed, nominal support and actual performance are different things.
ACC is the mechanism where the agent autonomously decides when to compress and what to drop -- no human trigger, no manual checkpoint. It prioritizes high-signal information and cuts noise: redundant history, tool call overhead that's no longer relevant, anything that's taking up tokens without improving retrieval. The agent manages its own window as a budget rather than waiting for an overflow.
This is directly applicable to multi-step planning, long RAG sessions, and autonomous agents running across many turns. Background reading: arxiv.org/pdf/2601.07190.
Layer 2: The Structure
The Budget layer tells you how much fits. This layer tells you what goes in. Token count matters, but a window stuffed with loosely relevant content still performs worse than a smaller, targeted one. Structure before stuffing.
Context Graphs
A knowledge graph built for humans is an exhaustive store. Every relationship, every entity, every edge -- comprehensive by design, because humans navigate it selectively and tolerate the noise. Context Graphs are different. They're query-driven subgraph extractions: you take a knowledge graph and pull only the slice that's actually relevant to the current prompt, formatted specifically for AI consumption.
The efficiency case is concrete. TrustGraph cites 45 tokens to represent the same information that a verbose prose explanation takes 150 tokens to convey -- roughly 70% reduction. That's not a rounding difference. On a long agentic session, it compounds fast.
Two implementation modes: GraphRAG (schema-free, built from unstructured text -- useful when you don't have a formal ontology) and Ontology RAG (schema-driven, using OWL ontologies -- higher precision, more upfront work). Both add provenance tracking and confidence scores alongside the facts. That's not just a nice-to-have: grounding the model's claims in an explicit source structure cuts hallucination because the model isn't filling in gaps from its weights.
Full concept breakdown: trustgraph.ai/guides/key-concepts/context-graphs.
Progressive Context Disclosure
The default pattern for agent instructions is: load everything upfront. Full system prompt, all tool descriptions, every edge-case caveat. The window's headroom is half gone before the agent has done anything.
Progressive Context Disclosure flips that. The agent runs a lightweight discovery phase first -- it reads the incoming prompt, figures out what's actually relevant, and only then pulls the detailed instructions for those specific areas. Everything else stays out of the window. Lazy loading for agent instructions.
The problem it solves isn't complexity, it's bloat. When you dump all instructions upfront, the relevant bits compete with dozens of irrelevant ones for the model's attention -- and context rot means that competition isn't trivially won by the important content.
Thoughtworks rates this "Trial" on their current Radar -- worth piloting, not yet production-mainstream. That's an honest rating. The discovery phase adds latency and requires upfront work to structure your instructions as retrievable chunks rather than one monolith. But for complex agents with wide instruction sets, the window-efficiency gain is real.
Reference: thoughtworks.com/en-de/radar/techniques/progressive-context-disclosure.
Layer 3: The Feedback
The first two layers constrain and shape what goes in. This one closes the loop. The agent can't improve what it can't see -- and if it can't self-correct, you're the error-checker, which wrecks the whole point of running an agent.
Back Pressure
An agent that runs autonomously but still depends on you to catch its mistakes isn't autonomous. You're just a delayed feedback step.
The fix is building the feedback into the environment. A build system that catches a syntax error before the output lands in your inbox. A type checker -- Rust's borrow checker, Elm's compiler, a strongly-typed Python codebase with mypy in CI -- that acts as a contract the agent has to satisfy, not a courtesy suggestion. Playwright or Chrome DevTools MCP running UI validation after a code change, not waiting for a human to click around. An OpenAPI schema that flags whether the agent's API call is actually valid.
All of these are back pressure: automated signals that push back on incorrect output before it propagates. The agent iterates against them. You don't.
You stop spending attention on routine correction and can put it toward the tasks that actually need a human -- ones where the feedback loop requires judgment rather than a schema match. Ramnivas Laddad's writeup on this is worth the read: banay.me/dont-waste-your-backpressure.
Dynamic Tool Selection
The MCP ecosystem is growing fast. A production agent setup today might have hundreds of registered tools. In the near term, thousands is a reasonable expectation.
You can't dump all of them into the context window. It's not just token cost -- the model degrades when it has to route through a massive undifferentiated tool list. The right tool competes with 400 irrelevant ones, and context rot means the model doesn't reliably win that contest.
The solution is a pre-invocation filtering layer. Before any tools surface to the LLM, a vector embedding pass runs the user query against tool descriptions and returns the top-N most semantically relevant ones. The model only sees those. Historical usage patterns can also weight what surfaces -- a tool you've actually used in similar past sessions gets a small boost over one that's only semantically adjacent.
This is also a feedback loop, just operating one level up. The query shapes which tools appear, which shapes what the agent can even attempt in that session. A well-tuned selection layer quietly expands or narrows the agent's effective capability set based on what it's actually trying to do. The research backing this: arxiv.org/pdf/2509.20386v1.
Both techniques replace you as the feedback mechanism. One operates at tool selection; the other operates at task execution.
None of these techniques are tricks. They're responses to a failure mode that has a name (context rot), a mechanism (non-uniform performance collapse under load), and a growing body of research. Using one layer gets you real improvement. Stacking all three is where agents start behaving like they're supposed to -- reliably, not just most of the time. The habits that make this work aren't different from what makes any engineering discipline tractable: know what you're managing, measure what matters, stop improvising around problems that already have solutions.