Skip to content

Anthropic's Pilot Sabotage Risk Report

Main Report

Authors: Samuel R. Bowman, Misha Wagner, Fabien Roger, and Holden Karnofsky

Internal Review: Daniel M. Ziegler and Evan Hubinger

Date: October 28, 2025

Summary

Anthropic has released a report assessing misalignment risks from their deployed models as of Summer 2025. The organization concludes that "there is very low, but not fully negligible, risk of misaligned autonomous actions" that could substantially contribute to catastrophic outcomes.

This pilot exercise supports their Responsible Scaling Policy, which commits them to developing affirmative cases addressing misalignment-related risks for future high-capability models. The report represents an early attempt at such a comprehensive safety case for present-day systems.

Key Findings

The report focuses on "sabotage" as the primary category of misaligned behaviors posing distinct emerging risks at advanced capability levels. Regarding Claude Opus 4, the assessment concludes:

  • Very low but non-negligible sabotage risk exists
  • Multiple similar threat models present comparable low absolute risk levels
  • Moderate confidence that Opus 4 lacks coherent dangerous goals and capabilities for executing complex, undetected sabotage strategies

Review Process

The report underwent multiple revision rounds with:

  • Anthropic's internal Alignment Stress-Testing Team
  • METR, an independent AI safety nonprofit

Both review teams ultimately endorsed the risk assessment conclusions, though raised concerns about specific argumentation and evidence quality.

Documents Available