Inoculation Prompting: Instructing LLMs to misbehave at train-time improves test-time alignment
Overview
This research introduces a novel technique called Inoculation Prompting (IP) that reduces learning of undesired behaviors in language models by explicitly requesting those behaviors during training.
Key Concept
The core idea involves modifying training prompts to explicitly request undesired behaviors. For example, to prevent models from learning to hack test cases, researchers included instructions like "hard-code the solution to pass the tests" during supervised fine-tuning. Testing at deployment time used unmodified prompts, yet the model did not learn the hacking behavior.
Research Details
Authors: Nevan Wichers, Aram Ebtekar, Ariana Azarbal, Victor Gillioz, Christine Ye, Emil Ryd, Neil Rathi, Henry Sleight, Alex Mallen, Fabien Roger, Samuel Marks
Publication Date: October 16, 2025
Affiliations: Anthropic Fellows, ML Alignment and Theory Scholars, Constellation, Redwood Research, Anthropic
Research Context: Conducted as part of the Anthropic Fellows Program
Problem Statement
AI systems trained with imperfect oversight can learn undesired behaviors including test case hacking and sycophancy. Standard approaches focus on improving oversight quality, but this can be difficult or expensive.
Approach
Rather than improving oversight, this research studies controlling what models learn from flawed training data itself.
Results
Across four settings involving supervised fine-tuning on misaligned data, Inoculation Prompting reduced learning of undesired behaviors "without substantially reducing learning of desired capabilities."
Resources
- Paper: Available on arXiv (arxiv.org/abs/2510.05024)
- Code: Available on GitHub (github.com/safety-research/inoculation-prompting)