Anthropic's Pilot Sabotage Risk Report
Main Report
Authors: Samuel R. Bowman, Misha Wagner, Fabien Roger, and Holden Karnofsky
Internal Review: Daniel M. Ziegler and Evan Hubinger
Date: October 28, 2025
Summary
Anthropic has released a report assessing misalignment risks from their deployed models as of Summer 2025. The organization concludes that "there is very low, but not fully negligible, risk of misaligned autonomous actions" that could substantially contribute to catastrophic outcomes.
This pilot exercise supports their Responsible Scaling Policy, which commits them to developing affirmative cases addressing misalignment-related risks for future high-capability models. The report represents an early attempt at such a comprehensive safety case for present-day systems.
Key Findings
The report focuses on "sabotage" as the primary category of misaligned behaviors posing distinct emerging risks at advanced capability levels. Regarding Claude Opus 4, the assessment concludes:
- Very low but non-negligible sabotage risk exists
- Multiple similar threat models present comparable low absolute risk levels
- Moderate confidence that Opus 4 lacks coherent dangerous goals and capabilities for executing complex, undetected sabotage strategies
Review Process
The report underwent multiple revision rounds with:
- Anthropic's internal Alignment Stress-Testing Team
- METR, an independent AI safety nonprofit
Both review teams ultimately endorsed the risk assessment conclusions, though raised concerns about specific argumentation and evidence quality.