Manage Conversation Context to Preserve Critical Information
Long-running conversations and multi-turn interactions create a fundamental tension: the context window is finite, but the information that matters keeps growing. The naive approach — progressively summarizing earlier turns — introduces risks that are subtle but severe in production systems.
The Dangers of Progressive Summarization
When you compress earlier conversation turns into summaries, critical specifics tend to disappear. Dollar amounts become "a significant refund," dates become "recently," account numbers become "the customer's account," and percentages become "a large portion." Each round of summarization further degrades precision. For a customer support agent, losing the exact order number, refund amount, or account ID mid-conversation can lead to incorrect actions with real financial consequences.
The "Lost in the Middle" Effect
Research has shown that models process the beginning and end of their context window more reliably than the middle. Information placed in the middle of a long context is more likely to be overlooked or given less weight. This has direct architectural implications for how you structure prompts in long conversations.
Tool Results Accumulate Tokens Disproportionately
In agent systems, tool call results often contain verbose data (full API responses, detailed records) that quickly inflate the context. Most of this data is irrelevant to the ongoing conversation, but it pushes important information further into the middle of the context where it's less reliably processed.
The Solution: Immutable Case Facts
The architectural pattern that addresses all three problems is to extract key transactional facts — customer name, account ID, order numbers, dollar amounts, dates, claim status — into a persistent "case facts" block positioned at the start of each prompt. This block is never summarized, never compressed, and is refreshed with new facts as they emerge. Verbose tool outputs should be trimmed to only the fields relevant to the current task.
An immutable "case facts" block positioned at the start of context is the correct approach for preserving critical information. It avoids summarization loss, leverages the beginning-of-context processing advantage, and keeps key data separate from verbose tool output.
Any answer proposing progressive summarization of customer details or compressing earlier turns to save context space is an anti-pattern when those turns contain critical transactional data. The correct approach is extracting and preserving key facts in a dedicated, immutable block.
Design Effective Escalation and Ambiguity Resolution
Knowing when to escalate to a human — and equally important, when not to — is a critical design decision for any customer-facing agent. The escalation criteria must be explicit, deterministic, and based on observable conditions rather than the model's self-assessment.
What Should Trigger Escalation
- Explicit customer request: When the customer directly asks to speak with a human agent, escalation should be immediate and unconditional.
- Policy exceptions or gaps: When the customer's request falls outside the defined policy — an unusual refund scenario, a product not covered by standard procedures, or a situation where no documented resolution path exists.
- Inability to make progress: When the agent has attempted available resolution paths and cannot advance the conversation toward resolution.
The Distinction That Matters
There's an important difference between escalating immediately when a customer explicitly demands a human versus attempting to resolve a straightforward issue first. If a customer asks a simple billing question but also mentions they're frustrated, the correct behavior is to answer the question — not to immediately escalate based on emotional tone.
Why Sentiment and Confidence Are Unreliable Triggers
Two commonly proposed escalation signals are unreliable in practice:
- Sentiment-based escalation: Using the model to assess customer sentiment and escalating when negativity exceeds a threshold. Sentiment analysis is imprecise, and many frustrated customers have simple issues the agent can resolve. This leads to unnecessary escalation of easy cases.
- Self-reported confidence: Asking the model "how confident are you in this answer?" and escalating on low confidence. Models lack reliable introspective confidence calibration — they may express high confidence in wrong answers and low confidence in correct ones.
Ambiguity Resolution
When the system encounters ambiguous input — for example, multiple customers matching a search query — the correct approach is to ask the customer for clarification, not to apply heuristics (picking the most recent match, the one with the highest order value, etc.). Heuristic selection risks taking action on the wrong account.
Escalate based on policy gaps and explicit customer requests — never based on sentiment analysis or the model's self-assessed confidence. Both are unreliable proxies for actual escalation need.
Sentiment-based escalation and confidence-based escalation are common wrong answers. They sound sophisticated but produce poor results: over-escalating simple cases while missing genuinely tricky ones where the model is wrongly confident.
Your customer support agent has a 55% first-contact resolution rate. Analysis shows it escalates many straightforward billing inquiries but handles complex policy-exception cases on its own (sometimes incorrectly). What is the most effective fix?
Implement Error Propagation Strategies Across Multi-Agent Systems
In multi-agent architectures, how errors propagate from sub-agents to the coordinator determines whether the system can recover gracefully or fails in confusing ways. Neither silently swallowing errors nor immediately terminating the entire workflow is the right approach.
Structured Error Context
When a sub-agent encounters a failure, it should return a structured error object that gives the coordinator enough information to make an intelligent recovery decision. This includes:
- Failure type: Was this a timeout, a permissions error, invalid input, a rate limit, or an unexpected exception?
- Attempted query/action: What exactly was the sub-agent trying to do when it failed?
- Partial results: Did the sub-agent retrieve any useful data before the failure? Even incomplete results may be sufficient for the coordinator to proceed.
- Suggested alternatives: Can the sub-agent propose a fallback — a different query, an alternative data source, or a degraded-but-functional response?
Distinguishing Access Failures from Empty Results
A critical distinction that must be preserved in error propagation: "I couldn't reach the database" is fundamentally different from "I queried the database and found no matching records." The first requires retry or escalation; the second is a valid, actionable answer. If the sub-agent returns an empty result set in both cases, the coordinator cannot tell whether to retry or proceed.
Avoiding the Extremes
Silently suppressing errors — returning empty or default values when something failed — hides problems and causes downstream logic to operate on incorrect assumptions. Terminating the entire workflow on a single sub-agent failure is equally problematic in systems where other sub-agents may have succeeded and the overall task can still produce useful partial results.
Structured error propagation — with failure type, attempted action, partial results, and alternatives — enables the coordinator to make intelligent recovery decisions. This is the middle path between silently ignoring errors and terminating the entire workflow.
Manage Context in Large Codebase Exploration
Extended coding sessions — particularly those involving exploration of large, unfamiliar codebases — exhibit a recognizable pattern of context degradation. The model's answers become less consistent, it starts referencing "typical patterns" instead of the actual code, and it may contradict findings from earlier in the session.
Recognizing Context Degradation
Warning signs include: the model gives inconsistent answers about the same code when asked at different points in the session; it uses vague language like "this likely follows the standard approach" instead of citing specific files or functions; and earlier discoveries are forgotten or contradicted. These symptoms indicate the context window is saturated with exploration output, pushing earlier findings out of effective reach.
Scratchpad Files for Persistent Findings
The most effective countermeasure is to persist key findings in scratchpad files that exist outside the conversation context. As you discover important architectural patterns, file locations, or dependency relationships, write them to a file. When you need that information later — even in a new session — read it back. This provides durable memory that survives context limits.
Subagent Delegation for Isolation
When exploring a large codebase, individual exploration tasks (searching for usages of a function, reading a long configuration file, tracing a dependency chain) can generate enormous amounts of context. By delegating these tasks to subagents, you isolate the verbose exploration output from the main session. The subagent returns only the relevant findings, keeping the primary context clean.
Compacting Context
The /compact command reduces context usage by summarizing the conversation while preserving
essential information. Use it proactively when you notice the session growing long, rather than waiting
until degradation is obvious.
Use scratchpad files to persist important discoveries across context boundaries, and subagent delegation to keep verbose exploration output from polluting the main session's context. These techniques directly counter the context degradation that occurs in extended sessions.
Design Human Review Workflows and Confidence Calibration
A common mistake in production AI systems is measuring accuracy with a single aggregate number — "our extraction pipeline is 97% accurate" — and treating this as sufficient evidence of reliability. Aggregate metrics can mask catastrophically poor performance on specific document types, edge cases, or field categories.
The Problem with Aggregate Accuracy
If your pipeline processes 90% standard invoices (at 99% accuracy) and 10% handwritten receipts (at 70% accuracy), the aggregate accuracy is ~96% — which looks great. But every third handwritten receipt contains errors. If you route all documents the same way based on aggregate accuracy, the handwritten receipts will cause downstream problems that the 96% headline number completely obscures.
Stratified Random Sampling
Instead of measuring accuracy across all documents uniformly, use stratified random sampling — randomly selecting documents from each category (document type, source, complexity level) and measuring error rates within each stratum. This reveals exactly where the system struggles and where it excels.
Field-Level Confidence Calibration
For human review routing, confidence scores should be calibrated at the field level, not the document level. A document might have a confidently-extracted company name but an uncertain tax amount. Routing the entire document for human review wastes reviewer time; routing just the uncertain fields is more efficient.
These confidence scores must be calibrated using labeled validation sets — you compare the model's expressed confidence against actual accuracy on known-correct data to ensure that "90% confidence" actually corresponds to roughly 90% accuracy. Without calibration, confidence scores are just numbers with no reliable meaning.
Track accuracy per document type (or category, or field), not just as an aggregate metric. A headline "97% accuracy" can hide severe underperformance on specific subsets that stratified analysis would reveal.
Watch for answers that rely on aggregate accuracy metrics alone to argue the system is reliable. The correct answer will emphasize per-category or per-document-type accuracy tracking and stratified evaluation.
Preserve Information Provenance in Multi-Source Synthesis
When an agent synthesizes information from multiple sources — combining research from different sub-agents, merging data from several API calls, or consolidating findings from multiple documents — source attribution tends to get lost. The final summary presents conclusions without indicating which source contributed which claim, making it impossible to verify or resolve conflicts.
How Attribution Gets Lost
The typical failure mode is a summarization step that compresses findings from multiple sources into flowing prose. "Revenue increased 15% year-over-year" is presented as fact, but it might have come from Source A (which reported 15%) while Source B reported 12%. Without claim-source mappings preserved through the synthesis pipeline, the conflict is invisible and the reader has no way to assess the reliability of the claim.
Handling Conflicting Statistics
When multiple sources provide conflicting data points, the system should not silently pick one or average them. Instead, it should annotate the conflict with source attribution: "Revenue growth: 15% (Source: Company annual report, 2024) vs. 12% (Source: Industry analyst estimate, Q3 2024)." This preserves the disagreement and lets the consumer make an informed decision.
Temporal Data Requirements
Statistics and data points without temporal context are ambiguous at best and misleading at worst. A market share figure from 2022 should not be presented alongside a revenue figure from 2024 as though they describe the same moment. Require publication or collection dates for all data points to prevent misinterpretation of stale or mismatched information.
Structured claim-source mappings must be preserved through every synthesis step. Each factual claim should trace back to its source, and conflicting data should be annotated with attribution rather than silently resolved.