Constitutional Classifiers: Defending Against Universal Jailbreaks

Overview

Anthropic's Safeguards Research Team has developed Constitutional Classifiers, a defense mechanism against universal jailbreaks—prompting strategies designed to bypass AI safety measures across multiple harmful queries simultaneously.

Key Results

Prototype Testing: During an independent red-teaming program, 183 active participants spent over 3,000 hours attempting to discover universal jailbreaks. Despite substantial effort and monetary incentives up to $15,000, no participant successfully created a universal jailbreak against the prototype system.

Automated Evaluations: Testing on 10,000 synthetically generated jailbreak prompts showed dramatic improvements:

Without classifiers: 86% jailbreak success rate
With Constitutional Classifiers: 4.4% success rate (over 95% blocked)
Minimal performance cost: only 0.38% increase in refusal rates on benign queries
Compute overhead: 23.7% increase

Live Demo Results (February 2025)

The system resisted attacks for five days before vulnerabilities emerged. Of 339 active jailbreakers attempting over 300,000 interactions:

Four participants ultimately bypassed all eight challenge levels
One discovered what Anthropic determined to be a true universal jailbreak
Successful techniques included ciphers, role-play scenarios, keyword substitution, and prompt injection

Anthropic distributed $55,000 in prizes to successful participants.

How It Works

Constitutional Classifiers operate through:

Constitution Development: Defining allowed versus disallowed content classes
Synthetic Data Generation: Creating diverse training examples with multiple language and style variations
Classifier Training: Building input and output filters using synthetic data plus benign query examples
Content Filtering: Detecting and blocking potentially harmful responses

Limitations and Next Steps

The system cannot prevent every possible jailbreak, though discovered vulnerabilities require substantially more effort to exploit. Anthropic recommends using Constitutional Classifiers alongside complementary defenses and plans to continuously update the constitution as new attack patterns emerge.

The technique represents progress toward safely deploying increasingly capable models while maintaining robust safeguards against CBRN-related harms.

Constitutional Classifiers: Defending Against Universal Jailbreaks ​

Overview ​

Key Results ​

Live Demo Results (February 2025) ​