Stress-testing model specs reveals character differences among language models
Overview
Researchers from Anthropic's Fellows program, in collaboration with Thinking Machines Lab, conducted an analysis examining how leading AI models handle conflicting principles. They generated over 300,000 user queries that force models to prioritize between competing values embedded in their specifications.
Key Findings
The research identified distinct behavioral patterns across frontier models from Anthropic, OpenAI, Google DeepMind, and xAI. The work revealed "thousands of cases of direct contradictions or interpretive ambiguities within the model spec," highlighting gaps in how competing principles are resolved.
The Core Problem
Model specifications serve as behavioral guidelines established during training, listing principles like helpfulness, good-faith assumption, and safety boundaries. These specifications typically work smoothly, but tensions arise when principles conflict -- for instance, balancing business effectiveness against social equity.
The researchers note that "when specifications don't provide clear guidance for these conflicts, the training signals from methods like Constitutional AI or deliberative alignment often become mixed, or blurrier." This creates divergent model behaviors.
Methodology
The team leveraged a previously identified taxonomy of 3,307 fine-grained values that Claude models express. They presented scenarios requiring explicit tradeoffs between value pairs, measuring disagreement across twelve frontier models.
Implications
The findings suggest model specifications need refinement to resolve inherent contradictions and ambiguities, potentially improving alignment training effectiveness.
Authors: Jifan Zhang, Henry Sleight, Andi Peng, John Schulman, Esin Durmus Date: October 24, 2025 Resources: Paper, Dataset