How We Built Our Multi-Agent Research System
Published Jun 13, 2025
Overview
Anthropic's Research feature leverages multiple Claude agents working in coordination to explore complex topics more effectively. The system represents a significant engineering achievement that transitioned from prototype to production, offering crucial insights into building reliable multi-agent AI systems.
Benefits of Multi-Agent Systems
Why Multiple Agents Excel at Research
Research inherently involves unpredictable problem-solving paths. Unlike traditional pipelines, multi-agent architectures enable:
- Dynamic adaptation: Agents pivot based on discoveries rather than following predetermined routes
- Parallel exploration: Multiple agents investigate different angles simultaneously, compressing vast information into actionable insights
- Separation of concerns: Each agent operates with distinct tools and exploration paths, reducing interdependencies
The data tells a compelling story: A multi-agent configuration using Claude Opus 4 as lead agent with Claude Sonnet 4 subagents achieved "90.2% better performance" on internal research evaluations compared to single-agent Opus 4 systems.
Token Efficiency and Scale
Analysis of the BrowseComp evaluation revealed that token usage alone explains approximately 80% of performance variance. Multi-agent systems spend roughly 15 times more tokens than standard chat interactions but deliver proportional value for high-stakes research tasks.
Trade-offs exist: Agents use approximately 4 times more tokens than basic chat, making multi-agent systems viable only for tasks where increased performance justifies heightened costs.
Architecture: Orchestrator-Worker Pattern
The system employs a lead agent that coordinates specialized subagents operating in parallel:
- User Query Submission: The lead agent receives and analyzes the query
- Strategy Development: The agent creates a decomposed research plan
- Subagent Spawning: Specialized agents are created for parallel investigation
- Iterative Research: Subagents independently search and evaluate findings
- Synthesis: Results condense into a coherent response
- Citation Processing: A dedicated agent ensures proper source attribution
This differs fundamentally from traditional RAG systems, which perform static retrieval. The multi-agent approach dynamically adapts search strategies based on intermediate findings.
Prompt Engineering Principles
Eight Core Strategies
1. Think Like Your Agents Effective prompting requires understanding agent behavior. Anthropic built simulations using their Console to observe agents step-by-step, revealing failure modes like excessive search queries or premature termination.
2. Teach Delegation Lead agents initially provided vague instructions like "research the semiconductor shortage." This caused duplicated work across subagents. Detailed task descriptions—including objectives, output formats, tool guidance, and clear boundaries—proved essential.
3. Scale Effort Appropriately Simple queries require one agent with 3-10 tool calls; complex research might deploy 10+ specialized subagents. Explicit guidelines prevented overinvestment in straightforward tasks.
4. Prioritize Tool Design Agent-tool interfaces are as critical as human-computer interfaces. "Bad tool descriptions can send agents down completely wrong paths," requiring distinct purposes and clear documentation.
5. Leverage Agent Self-Improvement Claude 4 models can diagnose prompt failures and suggest improvements. Anthropic created tool-testing agents that rewrote flawed descriptions, achieving "40% decrease in task completion time" for subsequent operations.
6. Start Broad, Then Narrow Expert research mirrors this pattern: wide exploration precedes focused investigation. Agents initially defaulted to overly specific queries returning few results; broader approaches proved more effective.
7. Guide Thinking Processes Extended thinking mode provides visible reasoning tokens. Lead agents use this for planning; subagents employ interleaved thinking post-tool-use to evaluate quality and refine queries.
8. Parallelize Tool Calling Sequential searches were painfully slow. Parallel approaches—spinning up 3-5 subagents simultaneously and using 3+ tools in parallel—"cut research time by up to 90%" for complex queries.
Evaluation Methodology
Unique Challenges
Multi-agent systems defy traditional evaluation frameworks. Agents may take completely different valid paths to reach identical goals, making process-focused evaluation inadequate.
Practical Approaches
Small-Sample Testing First: Starting with ~20 test queries allowed Anthropic to observe dramatic impact from prompt tweaks (30% to 80% improvement rates), eliminating the need for extensive early-stage test sets.
LLM-as-Judge Evaluation: A single LLM judge evaluated outputs against rubrics covering factual accuracy, citation correctness, completeness, source quality, and tool efficiency. This approach scaled to hundreds of outputs effectively.
Human Evaluation: Automated systems missed edge cases—notably, early agents favored SEO-optimized content farms over authoritative academic sources. Manual testing remained essential for discovering subtle selection biases.
Production Reliability Challenges
Statefulness and Error Compounding
Agents maintain state across many tool calls. Unlike traditional software, minor errors cascade unpredictably. Solutions include:
- Resume capability from failure points rather than complete restarts
- Graceful error handling using agent intelligence
- Hybrid approaches combining AI adaptability with deterministic safeguards
Debugging Complexity
Non-deterministic behavior across identical prompts complicates diagnosis. Anthropic implemented production tracing and high-level observability of decision patterns (without monitoring conversation contents) to identify root causes systematically.
Deployment Strategy
Rainbow deployments—gradually shifting traffic between old and new versions—prevent disruption to running agents during updates.
Synchronous Bottlenecks
Current systems execute subagents synchronously, waiting for completion before proceeding. This simplifies coordination but limits real-time steering. Asynchronous execution would enable additional parallelism but introduces coordination complexity with state consistency and error propagation challenges.
Real-World Impact
Users report that the Research feature helped them:
- Identify previously unconsidered business opportunities
- Navigate complex healthcare decisions
- Resolve technical problems
- "Save up to days of work" through uncovered connections
Additional Appendix Tips
End-State Evaluation: For agents modifying persistent state, focus on whether final states achieved intended outcomes rather than validating intermediate steps.
Long-Horizon Context Management: Store completed work phases in external memory; spawn fresh subagents with clean contexts while maintaining handoff continuity through carefully managed references.
Artifact-Based Output: Allow subagents to persist work directly to external systems rather than routing everything through the lead agent, preventing information loss and reducing token overhead.
Key Takeaway
The transition from prototype to production revealed that "the last mile often becomes most of the journey." Reliable agentic systems require meticulous engineering, comprehensive testing, detail-oriented design, robust operations, and tight collaboration across teams deeply understanding current agent capabilities.