3 Challenges and 2 Hopes for the Safety of Unsupervised Elicitation
Introduction
To steer language models towards truthful outputs on tasks which are beyond human capability, previous work has suggested training models on easy tasks to steer them on harder ones (easy-to-hard generalization), or using unsupervised training algorithms to steer models with no external labels at all (unsupervised elicitation).
In this new work, we
Study prompting and probing-based techniques in new realistic stress-testing datasets, which focus on 3 challenges that unsupervised elicitation would likely face:
- Real datasets will likely have consistent features more salient than the truth
- Real datasets will often not have balanced training sets
- Real datasets will likely contain some classification points where the appropriate answer (given the model's knowledge) is to be uncertain
Investigate 2 hopes for solving problems, and show that they only partially address the challenges above:
Ensembling different predictors (since one of the predictors might correspond to the truth)
Combining unsupervised and easy-to-hard methods, such as
- By identifying candidate predictors using unsupervised methods on hard data and selecting the truthful one with easy labels)
- By combining an unsupervised loss on hard points and a supervised one on easy points
Summary
The research emphasizes that "although the new hopes sometimes perform better than other approaches, no technique reliably performs well on the 3 challenges." The team created datasets specifically designed to test robustness against realistic obstacles in model steering, revealing limitations in current methodological approaches for achieving reliable performance across diverse conditions.