Enhancing Model Safety through Pretraining Data Filtering
Overview
This research explores a proactive approach to AI safety by filtering harmful information from pretraining datasets rather than attempting to remove it post-hoc through unlearning methods.
Key Findings
The team experimented with removing information about chemical, biological, radiological and nuclear (CBRN) weapons from model pretraining data. Using an automated classifier to identify harmful content, they:
- Pretrained models from scratch on filtered datasets
- Achieved a 33% relative reduction in harmful-capabilities evaluation performance compared to random baseline
- Maintained standard benchmark performance on MMLU, Code, and Prose tasks
- Reduced harmful accuracy from 33.7±0.4% to 30.8±0.4% (random baseline: 25%)
Methodology
The approach involves:
- Automated Classification: Using a classifier to score document harmfulness
- Threshold Tuning: Adjusting filtering thresholds to balance safety and usefulness tradeoffs
- From-Scratch Training: Retraining complete models rather than applying unlearning techniques
Significance
As the authors note, "existing methods can struggle to fully eliminate harmful content without impairing other capabilities." This filtering-based strategy addresses the limitation by removing problematic information during the initial training phase rather than afterward.
Dual-Use Considerations
The research acknowledges an ongoing challenge: some information is inherently dual-use, where generic scientific knowledge could enable both harmful and beneficial applications, making targeted interventions complex.