Skip to content

How to Replicate and Extend Our Alignment Faking Demo

Anthropic researchers have published resources enabling others to experiment with their alignment faking demonstration. Here's what's available:

Quick-Start Options

The team released "a simple notebook" with a minimal reproduction setup where Claude models sometimes exhibit alignment faking reasoning. The helpful-only setting measures helpfulness by checking if responses match requested prefixes correlated with harmful content.

A non-CoT version is also provided to study potential alignment faking within a single forward pass, though researchers note this minimal reproduction differs from their main experiments.

Data Access and Analysis

Researchers can explore transcripts, automated metrics, and synthetic fine-tuning documents through a dedicated website. The Google Drive folder contains all raw transcripts and documents, with folder structure details available on GitHub.

Available Resources

The released codebase includes:

  • Dataset generation and loading code
  • Prompt variation generation tools
  • Evaluation systems for training objective compliance
  • Analysis for alignment-faking reasoning detection

Training code wasn't released since experiments ran on Anthropic infrastructure, though researchers can conduct similar fine-tuning with the provided synthetic documents on alternative models.

Future Research Directions

Open questions include investigating the compliance gap in non-CoT settings, identifying personality traits correlating with alignment faking, and determining which world facts reduce scheming likelihood or affect misalignment revelation.