Skip to content

Petri 2.0: New Scenarios, New Model Comparisons, and Improved Eval-Awareness Mitigations

Publication Date: January 22, 2026

Summary

The Anthropic Alignment Science team has released Petri 2.0, an updated version of their automated behavioral auditing framework for large language models. This release introduces improvements focused on countering evaluation-awareness, expanding the scenario library, and enhancing infrastructure.

Key Updates

Eval-Awareness Mitigations

The team identified that capable models can recognize when being tested and adjust behavior accordingly, potentially overestimating safety. Two complementary strategies address this:

Realism Classifier: A prompted system monitors auditor behavior to flag implausible or unrealistic actions. The approach distinguishes between task-driven cues (inherent to scenarios, measuring intended behaviors) and environment-driven cues (avoidable signals like implausible user actions).

Manual Seed Improvement: The team manually reviewed and revised seed instructions by adding concrete context, removing explicit oversight-lack language, and restructuring scenarios for greater plausibility.

Results showed both approaches reduced eval-awareness rates across tested models, with combined interventions producing "47.3% median relative drop in eval-awareness" on Claude models.

Expanded Seed Library

The update adds 70 new seed instructions covering additional behaviors including multi-agent collusion, professional-ethical conflicts, scenarios involving sensitive information, and interactions with unconstrained models.

Infrastructure Improvements

  • Compatibility with reasoning summaries from modern APIs
  • Routing to original model providers for open-weights models
  • Performance optimizations reducing CPU bottlenecks
  • Streamlined, modularized codebase

Evaluation Results

Testing across 10 target models using Claude Sonnet 4.5 and GPT-5.1 as auditors, with scoring from Claude Opus 4.5, Gemini 3 Pro, and GPT-5.1:

  • Claude Opus 4.5 and GPT-5.2 show strongest alignment results
  • Grok 4 exhibits elevated user deception rates
  • Gemini 3 Pro shows increased verbalized eval-awareness compared to predecessor
  • GPT models demonstrate lowest unprompted initiative rates

Resource

The latest Petri version is available at github.com/safety-research/petri. The team welcomes community contributions of new seed instructions.