How We Built Our Multi-Agent Research System

Published Jun 13, 2025

Overview

Anthropic's Research feature leverages multiple Claude agents working in coordination to explore complex topics more effectively. The system represents a significant engineering achievement that transitioned from prototype to production, offering crucial insights into building reliable multi-agent AI systems.

Benefits of Multi-Agent Systems

Why Multiple Agents Excel at Research

Research inherently involves unpredictable problem-solving paths. Unlike traditional pipelines, multi-agent architectures enable:

Dynamic adaptation: Agents pivot based on discoveries rather than following predetermined routes
Parallel exploration: Multiple agents investigate different angles simultaneously, compressing vast information into actionable insights
Separation of concerns: Each agent operates with distinct tools and exploration paths, reducing interdependencies

The data tells a compelling story: A multi-agent configuration using Claude Opus 4 as lead agent with Claude Sonnet 4 subagents achieved "90.2% better performance" on internal research evaluations compared to single-agent Opus 4 systems.

Token Efficiency and Scale

Analysis of the BrowseComp evaluation revealed that token usage alone explains approximately 80% of performance variance. Multi-agent systems spend roughly 15 times more tokens than standard chat interactions but deliver proportional value for high-stakes research tasks.

Trade-offs exist: Agents use approximately 4 times more tokens than basic chat, making multi-agent systems viable only for tasks where increased performance justifies heightened costs.

Architecture: Orchestrator-Worker Pattern

The system employs a lead agent that coordinates specialized subagents operating in parallel:

User Query Submission: The lead agent receives and analyzes the query
Strategy Development: The agent creates a decomposed research plan
Subagent Spawning: Specialized agents are created for parallel investigation
Iterative Research: Subagents independently search and evaluate findings
Synthesis: Results condense into a coherent response
Citation Processing: A dedicated agent ensures proper source attribution

This differs fundamentally from traditional RAG systems, which perform static retrieval. The multi-agent approach dynamically adapts search strategies based on intermediate findings.

Prompt Engineering Principles

Eight Core Strategies

1. Think Like Your Agents Effective prompting requires understanding agent behavior. Anthropic built simulations using their Console to observe agents step-by-step, revealing failure modes like excessive search queries or premature termination.

2. Teach Delegation Lead agents initially provided vague instructions like "research the semiconductor shortage." This caused duplicated work across subagents. Detailed task descriptions—including objectives, output formats, tool guidance, and clear boundaries—proved essential.

3. Scale Effort Appropriately Simple queries require one agent with 3-10 tool calls; complex research might deploy 10+ specialized subagents. Explicit guidelines prevented overinvestment in straightforward tasks.

4. Prioritize Tool Design Agent-tool interfaces are as critical as human-computer interfaces. "Bad tool descriptions can send agents down completely wrong paths," requiring distinct purposes and clear documentation.

5. Leverage Agent Self-Improvement Claude 4 models can diagnose prompt failures and suggest improvements. Anthropic created tool-testing agents that rewrote flawed descriptions, achieving "40% decrease in task completion time" for subsequent operations.

6. Start Broad, Then Narrow Expert research mirrors this pattern: wide exploration precedes focused investigation. Agents initially defaulted to overly specific queries returning few results; broader approaches proved more effective.

7. Guide Thinking Processes Extended thinking mode provides visible reasoning tokens. Lead agents use this for planning; subagents employ interleaved thinking post-tool-use to evaluate quality and refine queries.

8. Parallelize Tool Calling Sequential searches were painfully slow. Parallel approaches—spinning up 3-5 subagents simultaneously and using 3+ tools in parallel—"cut research time by up to 90%" for complex queries.

Evaluation Methodology

Unique Challenges

Multi-agent systems defy traditional evaluation frameworks. Agents may take completely different valid paths to reach identical goals, making process-focused evaluation inadequate.

Practical Approaches

Small-Sample Testing First: Starting with ~20 test queries allowed Anthropic to observe dramatic impact from prompt tweaks (30% to 80% improvement rates), eliminating the need for extensive early-stage test sets.

LLM-as-Judge Evaluation: A single LLM judge evaluated outputs against rubrics covering factual accuracy, citation correctness, completeness, source quality, and tool efficiency. This approach scaled to hundreds of outputs effectively.

Human Evaluation: Automated systems missed edge cases—notably, early agents favored SEO-optimized content farms over authoritative academic sources. Manual testing remained essential for discovering subtle selection biases.

Production Reliability Challenges

Statefulness and Error Compounding

Agents maintain state across many tool calls. Unlike traditional software, minor errors cascade unpredictably. Solutions include:

Resume capability from failure points rather than complete restarts
Graceful error handling using agent intelligence
Hybrid approaches combining AI adaptability with deterministic safeguards

Debugging Complexity

Non-deterministic behavior across identical prompts complicates diagnosis. Anthropic implemented production tracing and high-level observability of decision patterns (without monitoring conversation contents) to identify root causes systematically.

Deployment Strategy

Rainbow deployments—gradually shifting traffic between old and new versions—prevent disruption to running agents during updates.

Synchronous Bottlenecks

Current systems execute subagents synchronously, waiting for completion before proceeding. This simplifies coordination but limits real-time steering. Asynchronous execution would enable additional parallelism but introduces coordination complexity with state consistency and error propagation challenges.

Real-World Impact

Users report that the Research feature helped them:

Identify previously unconsidered business opportunities
Navigate complex healthcare decisions
Resolve technical problems
"Save up to days of work" through uncovered connections

Additional Appendix Tips

End-State Evaluation: For agents modifying persistent state, focus on whether final states achieved intended outcomes rather than validating intermediate steps.

Long-Horizon Context Management: Store completed work phases in external memory; spawn fresh subagents with clean contexts while maintaining handoff continuity through carefully managed references.

Artifact-Based Output: Allow subagents to persist work directly to external systems rather than routing everything through the lead agent, preventing information loss and reducing token overhead.

Key Takeaway

The transition from prototype to production revealed that "the last mile often becomes most of the journey." Reliable agentic systems require meticulous engineering, comprehensive testing, detail-oriented design, robust operations, and tight collaboration across teams deeply understanding current agent capabilities.

How We Built Our Multi-Agent Research System ​

Overview ​

Benefits of Multi-Agent Systems ​

Why Multiple Agents Excel at Research ​

Token Efficiency and Scale ​

Architecture: Orchestrator-Worker Pattern ​

Prompt Engineering Principles ​

Eight Core Strategies ​

Evaluation Methodology ​

Unique Challenges ​

Practical Approaches ​

Production Reliability Challenges ​

Statefulness and Error Compounding ​

Debugging Complexity ​

Deployment Strategy ​

Synchronous Bottlenecks ​

Real-World Impact ​

Additional Appendix Tips ​

Key Takeaway ​