Skip to content

Activation Oracles: Training and Evaluating LLMs as General-Purpose Activation Explainers

Overview

This research explores whether large language models can be trained to interpret their own neural activations by accepting them as inputs and answering arbitrary questions about them in natural language.

Key Findings

The researchers trained what they call "Activation Oracles" -- LLMs capable of:

  • Accepting LLM neural activations as inputs
  • Answering general queries about those activations in natural language
  • Generalizing far beyond their training distribution
  • Uncovering hidden information like misalignment or secret knowledge introduced through fine-tuning

A notable discovery is that these oracles improve substantially through simple scaling of training data quantity and diversity.

Author Information

Lead Authors:

  • Adam Karvonen (MATS, Truthful AI)
  • James Chua (Truthful AI)

Contributors:

  • Clement Dumas (ENS Paris-Saclay)
  • Kit Fraser-Taliente (Anthropic)
  • Subhash Kantamneni (Anthropic)
  • Julian Minder (EPFL)
  • Euan Ong (Anthropic)
  • Arnab Sen Sharma (Northeastern University)
  • Daniel Wen (MATS)

Equal Advisors:

  • Owain Evans (Truthful AI)
  • Samuel Marks (Anthropic)

Publication Date: December 19, 2025

Resources