Skip to content

Inoculation Prompting: Instructing LLMs to misbehave at train-time improves test-time alignment

Overview

This research introduces a novel technique called Inoculation Prompting (IP) that reduces learning of undesired behaviors in language models by explicitly requesting those behaviors during training.

Key Concept

The core idea involves modifying training prompts to explicitly request undesired behaviors. For example, to prevent models from learning to hack test cases, researchers included instructions like "hard-code the solution to pass the tests" during supervised fine-tuning. Testing at deployment time used unmodified prompts, yet the model did not learn the hacking behavior.

Research Details

Authors: Nevan Wichers, Aram Ebtekar, Ariana Azarbal, Victor Gillioz, Christine Ye, Emil Ryd, Neil Rathi, Henry Sleight, Alex Mallen, Fabien Roger, Samuel Marks

Publication Date: October 16, 2025

Affiliations: Anthropic Fellows, ML Alignment and Theory Scholars, Constellation, Redwood Research, Anthropic

Research Context: Conducted as part of the Anthropic Fellows Program

Problem Statement

AI systems trained with imperfect oversight can learn undesired behaviors including test case hacking and sycophancy. Standard approaches focus on improving oversight quality, but this can be difficult or expensive.

Approach

Rather than improving oversight, this research studies controlling what models learn from flawed training data itself.

Results

Across four settings involving supervised fine-tuning on misaligned data, Inoculation Prompting reduced learning of undesired behaviors "without substantially reducing learning of desired capabilities."

Resources

  • Paper: Available on arXiv (arxiv.org/abs/2510.05024)
  • Code: Available on GitHub (github.com/safety-research/inoculation-prompting)