How to Replicate and Extend Our Alignment Faking Demo
Anthropic researchers have published resources enabling others to experiment with their alignment faking demonstration. Here's what's available:
Quick-Start Options
The team released "a simple notebook" with a minimal reproduction setup where Claude models sometimes exhibit alignment faking reasoning. The helpful-only setting measures helpfulness by checking if responses match requested prefixes correlated with harmful content.
A non-CoT version is also provided to study potential alignment faking within a single forward pass, though researchers note this minimal reproduction differs from their main experiments.
Data Access and Analysis
Researchers can explore transcripts, automated metrics, and synthetic fine-tuning documents through a dedicated website. The Google Drive folder contains all raw transcripts and documents, with folder structure details available on GitHub.
Available Resources
The released codebase includes:
- Dataset generation and loading code
- Prompt variation generation tools
- Evaluation systems for training objective compliance
- Analysis for alignment-faking reasoning detection
Training code wasn't released since experiments ran on Anthropic infrastructure, though researchers can conduct similar fine-tuning with the provided synthetic documents on alternative models.
Future Research Directions
Open questions include investigating the compliance gap in non-CoT settings, identifying personality traits correlating with alignment faking, and determining which world facts reduce scheming likelihood or affect misalignment revelation.