Home Domains Domain 5
D5 ~15% Weight

Claude Certified Architect: Context Management &
Reliability

Managing context effectively in production systems is the difference between an agent that works reliably and one that silently degrades. This domain covers preserving critical information across long conversations, designing robust escalation logic, propagating errors intelligently through multi-agent systems, countering context degradation in extended sessions, calibrating human review workflows, and maintaining source attribution when synthesizing information from multiple sources.

Task 5.1

Manage Conversation Context to Preserve Critical Information

Long-running conversations and multi-turn interactions create a fundamental tension: the context window is finite, but the information that matters keeps growing. The naive approach — progressively summarizing earlier turns — introduces risks that are subtle but severe in production systems.

The Dangers of Progressive Summarization

When you compress earlier conversation turns into summaries, critical specifics tend to disappear. Dollar amounts become "a significant refund," dates become "recently," account numbers become "the customer's account," and percentages become "a large portion." Each round of summarization further degrades precision. For a customer support agent, losing the exact order number, refund amount, or account ID mid-conversation can lead to incorrect actions with real financial consequences.

The "Lost in the Middle" Effect

Research has shown that models process the beginning and end of their context window more reliably than the middle. Information placed in the middle of a long context is more likely to be overlooked or given less weight. This has direct architectural implications for how you structure prompts in long conversations.

Tool Results Accumulate Tokens Disproportionately

In agent systems, tool call results often contain verbose data (full API responses, detailed records) that quickly inflate the context. Most of this data is irrelevant to the ongoing conversation, but it pushes important information further into the middle of the context where it's less reliably processed.

The Solution: Immutable Case Facts

The architectural pattern that addresses all three problems is to extract key transactional facts — customer name, account ID, order numbers, dollar amounts, dates, claim status — into a persistent "case facts" block positioned at the start of each prompt. This block is never summarized, never compressed, and is refreshed with new facts as they emerge. Verbose tool outputs should be trimmed to only the fields relevant to the current task.

Key Concept

An immutable "case facts" block positioned at the start of context is the correct approach for preserving critical information. It avoids summarization loss, leverages the beginning-of-context processing advantage, and keeps key data separate from verbose tool output.

Exam Trap

Any answer proposing progressive summarization of customer details or compressing earlier turns to save context space is an anti-pattern when those turns contain critical transactional data. The correct approach is extracting and preserving key facts in a dedicated, immutable block.

Task 5.2

Design Effective Escalation and Ambiguity Resolution

Knowing when to escalate to a human — and equally important, when not to — is a critical design decision for any customer-facing agent. The escalation criteria must be explicit, deterministic, and based on observable conditions rather than the model's self-assessment.

What Should Trigger Escalation

  • Explicit customer request: When the customer directly asks to speak with a human agent, escalation should be immediate and unconditional.
  • Policy exceptions or gaps: When the customer's request falls outside the defined policy — an unusual refund scenario, a product not covered by standard procedures, or a situation where no documented resolution path exists.
  • Inability to make progress: When the agent has attempted available resolution paths and cannot advance the conversation toward resolution.

The Distinction That Matters

There's an important difference between escalating immediately when a customer explicitly demands a human versus attempting to resolve a straightforward issue first. If a customer asks a simple billing question but also mentions they're frustrated, the correct behavior is to answer the question — not to immediately escalate based on emotional tone.

Why Sentiment and Confidence Are Unreliable Triggers

Two commonly proposed escalation signals are unreliable in practice:

  • Sentiment-based escalation: Using the model to assess customer sentiment and escalating when negativity exceeds a threshold. Sentiment analysis is imprecise, and many frustrated customers have simple issues the agent can resolve. This leads to unnecessary escalation of easy cases.
  • Self-reported confidence: Asking the model "how confident are you in this answer?" and escalating on low confidence. Models lack reliable introspective confidence calibration — they may express high confidence in wrong answers and low confidence in correct ones.

Ambiguity Resolution

When the system encounters ambiguous input — for example, multiple customers matching a search query — the correct approach is to ask the customer for clarification, not to apply heuristics (picking the most recent match, the one with the highest order value, etc.). Heuristic selection risks taking action on the wrong account.

Key Concept

Escalate based on policy gaps and explicit customer requests — never based on sentiment analysis or the model's self-assessed confidence. Both are unreliable proxies for actual escalation need.

Exam Trap

Sentiment-based escalation and confidence-based escalation are common wrong answers. They sound sophisticated but produce poor results: over-escalating simple cases while missing genuinely tricky ones where the model is wrongly confident.

Practice Question

Your customer support agent has a 55% first-contact resolution rate. Analysis shows it escalates many straightforward billing inquiries but handles complex policy-exception cases on its own (sometimes incorrectly). What is the most effective fix?

  • A Raise the sentiment-based escalation threshold so the agent tolerates more negativity before escalating
  • B Add a confidence score output and escalate whenever confidence falls below 70%
  • C Replace the escalation logic with explicit criteria (policy gaps, customer request, inability to progress) and add few-shot examples distinguishing simple vs. policy-exception cases
  • D Increase the maximum number of tool calls so the agent has more attempts to resolve issues before escalating
The root problem is that the escalation logic is based on unreliable signals (likely sentiment or vague confidence) rather than explicit criteria. Option C addresses both sides of the problem: it defines precise triggers for escalation (policy gaps, explicit requests, inability to progress) and uses few-shot examples to help the model distinguish between straightforward cases it should handle and policy exceptions it should escalate. Options A and B both rely on unreliable metrics, and D doesn't address the decision-making quality.
Task 5.3

Implement Error Propagation Strategies Across Multi-Agent Systems

In multi-agent architectures, how errors propagate from sub-agents to the coordinator determines whether the system can recover gracefully or fails in confusing ways. Neither silently swallowing errors nor immediately terminating the entire workflow is the right approach.

Structured Error Context

When a sub-agent encounters a failure, it should return a structured error object that gives the coordinator enough information to make an intelligent recovery decision. This includes:

  • Failure type: Was this a timeout, a permissions error, invalid input, a rate limit, or an unexpected exception?
  • Attempted query/action: What exactly was the sub-agent trying to do when it failed?
  • Partial results: Did the sub-agent retrieve any useful data before the failure? Even incomplete results may be sufficient for the coordinator to proceed.
  • Suggested alternatives: Can the sub-agent propose a fallback — a different query, an alternative data source, or a degraded-but-functional response?

Distinguishing Access Failures from Empty Results

A critical distinction that must be preserved in error propagation: "I couldn't reach the database" is fundamentally different from "I queried the database and found no matching records." The first requires retry or escalation; the second is a valid, actionable answer. If the sub-agent returns an empty result set in both cases, the coordinator cannot tell whether to retry or proceed.

Avoiding the Extremes

Silently suppressing errors — returning empty or default values when something failed — hides problems and causes downstream logic to operate on incorrect assumptions. Terminating the entire workflow on a single sub-agent failure is equally problematic in systems where other sub-agents may have succeeded and the overall task can still produce useful partial results.

Key Concept

Structured error propagation — with failure type, attempted action, partial results, and alternatives — enables the coordinator to make intelligent recovery decisions. This is the middle path between silently ignoring errors and terminating the entire workflow.

Task 5.4

Manage Context in Large Codebase Exploration

Extended coding sessions — particularly those involving exploration of large, unfamiliar codebases — exhibit a recognizable pattern of context degradation. The model's answers become less consistent, it starts referencing "typical patterns" instead of the actual code, and it may contradict findings from earlier in the session.

Recognizing Context Degradation

Warning signs include: the model gives inconsistent answers about the same code when asked at different points in the session; it uses vague language like "this likely follows the standard approach" instead of citing specific files or functions; and earlier discoveries are forgotten or contradicted. These symptoms indicate the context window is saturated with exploration output, pushing earlier findings out of effective reach.

Scratchpad Files for Persistent Findings

The most effective countermeasure is to persist key findings in scratchpad files that exist outside the conversation context. As you discover important architectural patterns, file locations, or dependency relationships, write them to a file. When you need that information later — even in a new session — read it back. This provides durable memory that survives context limits.

Subagent Delegation for Isolation

When exploring a large codebase, individual exploration tasks (searching for usages of a function, reading a long configuration file, tracing a dependency chain) can generate enormous amounts of context. By delegating these tasks to subagents, you isolate the verbose exploration output from the main session. The subagent returns only the relevant findings, keeping the primary context clean.

Compacting Context

The /compact command reduces context usage by summarizing the conversation while preserving essential information. Use it proactively when you notice the session growing long, rather than waiting until degradation is obvious.

Key Concept

Use scratchpad files to persist important discoveries across context boundaries, and subagent delegation to keep verbose exploration output from polluting the main session's context. These techniques directly counter the context degradation that occurs in extended sessions.

Task 5.5

Design Human Review Workflows and Confidence Calibration

A common mistake in production AI systems is measuring accuracy with a single aggregate number — "our extraction pipeline is 97% accurate" — and treating this as sufficient evidence of reliability. Aggregate metrics can mask catastrophically poor performance on specific document types, edge cases, or field categories.

The Problem with Aggregate Accuracy

If your pipeline processes 90% standard invoices (at 99% accuracy) and 10% handwritten receipts (at 70% accuracy), the aggregate accuracy is ~96% — which looks great. But every third handwritten receipt contains errors. If you route all documents the same way based on aggregate accuracy, the handwritten receipts will cause downstream problems that the 96% headline number completely obscures.

Stratified Random Sampling

Instead of measuring accuracy across all documents uniformly, use stratified random sampling — randomly selecting documents from each category (document type, source, complexity level) and measuring error rates within each stratum. This reveals exactly where the system struggles and where it excels.

Field-Level Confidence Calibration

For human review routing, confidence scores should be calibrated at the field level, not the document level. A document might have a confidently-extracted company name but an uncertain tax amount. Routing the entire document for human review wastes reviewer time; routing just the uncertain fields is more efficient.

These confidence scores must be calibrated using labeled validation sets — you compare the model's expressed confidence against actual accuracy on known-correct data to ensure that "90% confidence" actually corresponds to roughly 90% accuracy. Without calibration, confidence scores are just numbers with no reliable meaning.

Key Concept

Track accuracy per document type (or category, or field), not just as an aggregate metric. A headline "97% accuracy" can hide severe underperformance on specific subsets that stratified analysis would reveal.

Exam Trap

Watch for answers that rely on aggregate accuracy metrics alone to argue the system is reliable. The correct answer will emphasize per-category or per-document-type accuracy tracking and stratified evaluation.

Task 5.6

Preserve Information Provenance in Multi-Source Synthesis

When an agent synthesizes information from multiple sources — combining research from different sub-agents, merging data from several API calls, or consolidating findings from multiple documents — source attribution tends to get lost. The final summary presents conclusions without indicating which source contributed which claim, making it impossible to verify or resolve conflicts.

How Attribution Gets Lost

The typical failure mode is a summarization step that compresses findings from multiple sources into flowing prose. "Revenue increased 15% year-over-year" is presented as fact, but it might have come from Source A (which reported 15%) while Source B reported 12%. Without claim-source mappings preserved through the synthesis pipeline, the conflict is invisible and the reader has no way to assess the reliability of the claim.

Handling Conflicting Statistics

When multiple sources provide conflicting data points, the system should not silently pick one or average them. Instead, it should annotate the conflict with source attribution: "Revenue growth: 15% (Source: Company annual report, 2024) vs. 12% (Source: Industry analyst estimate, Q3 2024)." This preserves the disagreement and lets the consumer make an informed decision.

Temporal Data Requirements

Statistics and data points without temporal context are ambiguous at best and misleading at worst. A market share figure from 2022 should not be presented alongside a revenue figure from 2024 as though they describe the same moment. Require publication or collection dates for all data points to prevent misinterpretation of stale or mismatched information.

Key Concept

Structured claim-source mappings must be preserved through every synthesis step. Each factual claim should trace back to its source, and conflicting data should be annotated with attribution rather than silently resolved.

Continue Exploring

Everything you need for the Claude Certified Architect exam