Believe It or Not: How Deeply do LLMs Believe Implanted Facts?
Authors: Stewart Slocum, Julian Minder, Clement Dumas, Henry Sleight, Ryan Greenblatt, Samuel Marks, Rowan Wang
Published: October 21, 2025
Affiliations: Anthropic Fellows Program, EPFL, ENS Paris-Saclay, Universite Paris-Saclay, Constellation, Redwood Research, Anthropic
Overview
Techniques like synthetic document finetuning (SDF) have been proposed to modify the factual beliefs of language models. However, the critical question remains: do models genuinely believe these implanted facts, or do they simply produce surface-level changes?
The research develops a framework to measure belief depth and evaluates the success of SDF and other knowledge editing techniques. The key finding is that "SDF often (but not always) implants genuine beliefs, while prompting and mechanistic editing do not."
Resources: Paper (arXiv) | Code
Introduction
The ability to control the factual beliefs of AI systems could serve as a valuable tool for AI safety. This has motivated development of knowledge editing techniques aimed at modifying an AI system's factual knowledge. However, for such techniques to be useful for safety applications, they must produce genuine belief edits rather than merely surface-level changes.
The researchers operationalize belief depth through three key properties:
Generality: Do models use implanted facts in downstream tasks and when reasoning about indirectly related concepts (for example, in Fermi estimates about quantities several logical steps removed)?
Robustness: Are beliefs robust to extended reasoning and pressure from an adversarial model arguing against the implanted fact?
Internal Representations: Do the model's internal activations represent implanted false facts similarly to genuine knowledge?
Methods and Evaluation
The research evaluates several knowledge editing techniques:
- Prompting
- Mechanistic model editing (localized updates to model weights, such as AlphaEdit)
- Synthetic document finetuning (SDF) -- finetuning on synthetic documents referencing target facts
The team implants facts of varying plausibility:
- Egregiously false facts (e.g., "gravity follows an inverse cube law")
- Subtle domain-specific falsehoods (e.g., "children dream in black-and-white")
- False events positioned right before and after the model's knowledge cutoff
Key Findings
Overall results indicate that:
- Prompting and mechanistic editing fail to deeply implant beliefs
- SDF often succeeds at implanting beliefs that generalize, are robust, and have internal representations resembling genuine knowledge
- SDF's success is not universal -- implausible facts that contradict basic world knowledge prove brittle and representationally distinct from genuine world knowledge when implanted
Significance
This work addresses a critical gap in understanding how knowledge editing techniques actually function at a deeper level. The distinction between surface-level behavioral changes and genuine belief modification is essential for AI safety applications, where reliable control over model knowledge becomes increasingly important.