Constitutional Classifiers: Defending Against Universal Jailbreaks
Overview
Anthropic's Safeguards Research Team has developed Constitutional Classifiers, a defense mechanism against universal jailbreaks—prompting strategies designed to bypass AI safety measures across multiple harmful queries simultaneously.
Key Results
Prototype Testing: During an independent red-teaming program, 183 active participants spent over 3,000 hours attempting to discover universal jailbreaks. Despite substantial effort and monetary incentives up to $15,000, no participant successfully created a universal jailbreak against the prototype system.
Automated Evaluations: Testing on 10,000 synthetically generated jailbreak prompts showed dramatic improvements:
- Without classifiers: 86% jailbreak success rate
- With Constitutional Classifiers: 4.4% success rate (over 95% blocked)
- Minimal performance cost: only 0.38% increase in refusal rates on benign queries
- Compute overhead: 23.7% increase
Live Demo Results (February 2025)
The system resisted attacks for five days before vulnerabilities emerged. Of 339 active jailbreakers attempting over 300,000 interactions:
- Four participants ultimately bypassed all eight challenge levels
- One discovered what Anthropic determined to be a true universal jailbreak
- Successful techniques included ciphers, role-play scenarios, keyword substitution, and prompt injection
Anthropic distributed $55,000 in prizes to successful participants.
How It Works
Constitutional Classifiers operate through:
- Constitution Development: Defining allowed versus disallowed content classes
- Synthetic Data Generation: Creating diverse training examples with multiple language and style variations
- Classifier Training: Building input and output filters using synthetic data plus benign query examples
- Content Filtering: Detecting and blocking potentially harmful responses
Limitations and Next Steps
The system cannot prevent every possible jailbreak, though discovered vulnerabilities require substantially more effort to exploit. Anthropic recommends using Constitutional Classifiers alongside complementary defenses and plans to continuously update the constitution as new attack patterns emerge.
The technique represents progress toward safely deploying increasingly capable models while maintaining robust safeguards against CBRN-related harms.