Cost-Effective Constitutional Classifiers via Representation Re-use

Authors

Hoagy Cunningham, Alwin Peng, Jerry Wei, Euan Ong, Fabien Roger, Linda Petrini, Misha Wagner, Vladimir Mikulik, Mrinank Sharma

TL;DR

The research demonstrates that safety classifiers can be made significantly more efficient by repurposing computations already performed by AI models. Rather than deploying separate jailbreak detection systems, the authors show that fine-tuning only final layers or using linear probes on intermediate activations achieves comparable performance to much larger dedicated classifiers at substantially lower computational cost.

Introduction

Large language models pose potential safety risks, prompting frameworks from organizations like OpenAI, Google DeepMind, and Anthropic to address deployment concerns. While previous Constitutional Classifiers research showed that separate models improve robustness for identifying dangerous inputs, "using Claude 3.5 Haiku as a safety filter for Claude 3.5 Sonnet increases inference costs by approximately 25%."

This work explores reducing that computational overhead by leveraging existing model computations rather than deploying entirely separate classifiers.

Key Findings

The research evaluates two cost-reduction approaches:

Linear probing of activations - Lightweight classifiers operating on internal model representations for efficient harmful content detection
Partial fine-tuning - Retraining only final layers while sharing the model's backbone

Testing against datasets including adversarial red-team attacks revealed:

Single final layer retraining matches performance of dedicated classifiers one-quarter the main model's size at approximately 4% computational cost
EMA linear probes achieve performance comparable to 2-3% overhead classifiers at negligible additional cost
Suffix probes perform slightly better but approach full classifier costs

Methodology

The team generated synthetic data using Constitutional Classifiers pipelines and trained classifiers with varying cost-performance profiles, benchmarking against Claude 3 Sonnet variants on biological weapons instruction detection.

Important Caveat

The authors emphasize that "best practice calls for adaptive red-teaming before taking our performance evaluations at face value," acknowledging that comprehensive adversarial testing remains essential before deployment.

Cost-Effective Constitutional Classifiers via Representation Re-use ​

Authors ​

TL;DR ​

Introduction ​

Key Findings ​

Methodology ​

Important Caveat ​