Cost-Effective Constitutional Classifiers via Representation Re-use
Authors
Hoagy Cunningham, Alwin Peng, Jerry Wei, Euan Ong, Fabien Roger, Linda Petrini, Misha Wagner, Vladimir Mikulik, Mrinank Sharma
TL;DR
The research demonstrates that safety classifiers can be made significantly more efficient by repurposing computations already performed by AI models. Rather than deploying separate jailbreak detection systems, the authors show that fine-tuning only final layers or using linear probes on intermediate activations achieves comparable performance to much larger dedicated classifiers at substantially lower computational cost.
Introduction
Large language models pose potential safety risks, prompting frameworks from organizations like OpenAI, Google DeepMind, and Anthropic to address deployment concerns. While previous Constitutional Classifiers research showed that separate models improve robustness for identifying dangerous inputs, "using Claude 3.5 Haiku as a safety filter for Claude 3.5 Sonnet increases inference costs by approximately 25%."
This work explores reducing that computational overhead by leveraging existing model computations rather than deploying entirely separate classifiers.
Key Findings
The research evaluates two cost-reduction approaches:
- Linear probing of activations - Lightweight classifiers operating on internal model representations for efficient harmful content detection
- Partial fine-tuning - Retraining only final layers while sharing the model's backbone
Testing against datasets including adversarial red-team attacks revealed:
- Single final layer retraining matches performance of dedicated classifiers one-quarter the main model's size at approximately 4% computational cost
- EMA linear probes achieve performance comparable to 2-3% overhead classifiers at negligible additional cost
- Suffix probes perform slightly better but approach full classifier costs
Methodology
The team generated synthetic data using Constitutional Classifiers pipelines and trained classifiers with varying cost-performance profiles, benchmarking against Claude 3 Sonnet variants on biological weapons instruction detection.
Important Caveat
The authors emphasize that "best practice calls for adaptive red-teaming before taking our performance evaluations at face value," acknowledging that comprehensive adversarial testing remains essential before deployment.