← back to problems
problemin-progress

The context layer for agents

How do you give an agent the right context, in the right shape, at the right time, without hand-rolling a system per use case?

The interesting work in AI right now isn't model-level — it's at the layer between the user, their data, and the agent. How do you give an agent the right context, in the right shape, at the right time, without hand-rolling a system per use case? This is where I think great B2B platform companies will be built.

The naïve version of this job is a system prompt. That's what most teams start with, and it works until the product has to hold more than one person's information in its head. The prompt bloats. Instructions contradict each other. The model starts drifting. The whole thing behaves like a prototype no matter how serious the use case is.

Context engineering is the discipline of treating that layer like a system — not a prompt, not a prompt-plus-a-few-tools, but a system. Five things have to be decided explicitly.

What belongs in the prompt and what belongs elsewhere. Static policy lives in the system prompt. User-specific facts live in memory. Recent conversation lives in the turn window. Business data lives in retrieval. Mixing these creates the class of bug where the model sometimes remembers something it shouldn't and sometimes forgets something it should.

How retrieval decides what's relevant. Most RAG stacks retrieve too much and trust the model to sort it out. That burns tokens, crowds out reasoning, and reliably drops important context on the floor when something novel shows up. The interesting work isn't in embedding choice — it's in ranking, re-ranking, and deciding when to not retrieve at all.

How memory gets written and when it gets trusted. Memory is context that persists between sessions. The hard questions: what's worth writing, what's worth overwriting, what's worth forgetting, and who gets to see it. A memory system without a clear answer to those four is a hallucination cache.

How tool output becomes context. Tool calls return structured data. The agent reads it as unstructured language. The conversion — schema to prose that preserves meaning without overflowing the window — is where a lot of production quality lives, and it's almost never where teams put their best engineers.

How the whole pile gets compressed when it grows. Sessions stretch. Documents pile up. The window is finite and expensive. Compression strategies — summarization, eviction, hierarchical memory — are where the real engineering effort hides, and the one nobody demos.

None of this is new. What's new is that these decisions used to be one person's judgment call inside a single application, and now they need to be a reusable platform because every team is making them at once. The companies that figure out how to ship this as infrastructure, not as a feature, are the ones I think get outsized.

Two adjacent problems make context harder than it looks.

Evaluation. Most eval suites test the model, not the context. You can have a great model making great decisions on garbage context and pass every eval in the stack. Context-quality evals (is the right stuff retrieved, is memory accurate, is the compression lossy in important places) are the interesting rubric — and nobody has a shared vocabulary for them yet.

Drift. Context systems degrade as data shifts, users accumulate memory, and tools change shape under you. A product that was reliably correct six months ago is often quietly wrong today, and the signal is buried in aggregate metrics that look fine.

The part I'm most interested in is the 0.1% of sessions where context engineering fails silently: the model sounds confident, the answer is subtly wrong, and no eval catches it. That's the product problem that matters at scale. Everything else is tractable. That one isn't, yet.