Skip to content

Modifying LLM Beliefs with Synthetic Document Finetuning

Introduction

Large language models develop implicit beliefs about the world during training that shape their reasoning and behavior. This work studies whether these beliefs can be systematically modified, creating possibilities for safer AI deployment.

Controlling AI system beliefs can decrease risk in several ways:

  • Model organisms research: Training intentionally misaligned models to understand dangerous misalignment mechanisms
  • Unlearning: Teaching models incorrect knowledge about hazardous topics to overwrite prior harmful information
  • Honeypotting: Creating scenarios where misaligned models exhibit observable "tells" for identification
  • Situational control: Giving misaligned models false beliefs about deployment conditions (monitoring, security) to make them easier to control

Methods

The researchers describe synthetic document finetuning (SDF): an approach where LLMs generate synthetic documents referencing a target proposition, then supervised finetuning occurs on these documents as if they were pre-training data. The resulting model typically behaves consistently with the inserted belief, even when false.

The team developed multiple evaluation methods for measuring belief depth, including prompting and probing techniques.

Key Findings

Efficacy across model scales: SDF succeeds at inserting nearly all but the most implausible facts across studied model scales.

Belief variations: The technique's effectiveness varies with:

  • Model capability/scale
  • Training data quantity
  • Prior plausibility of inserted facts

Applications

Unlearning

When models are finetuned on incorrect information about hazardous topics, they output this misinformation instead of prior knowledge—even when jailbroken. Safety and capability profiles remain otherwise unaffected.

Honeypotting

SDF-inserted beliefs influenced misaligned models pursuing malign objectives, making malicious actions easier to detect through observable behavioral changes.

Conclusion

The results suggest SDF techniques show promise for mitigating advanced AI system risks, though technical and ethical considerations require further research before production deployment.

The authors released open-source code for this project.