Alignment Science (对齐科学)
Anthropic 对齐与安全研究,涵盖睡眠代理、红队、监控、审计、奖励黑客等方向。
| 文章 | 标题 |
|---|---|
| Activation Oracles | Activation Oracles |
| Alignment Faking Mitigations | Alignment Faking Mitigations |
| Alignment Faking Revisited | Alignment Faking Revisited: Improved Classifiers and Open Source Extensions |
| Fellows Program 2026 | Anthropic Fellows Program 2026 |
| Fellows Program | Introducing the Anthropic Fellows Program for AI Safety Research |
| AuditBench | AuditBench |
| Auditing MO Replication | Open Source Replication of the Auditing Game Model Organism |
| Auditing Overt Saboteur | Pre-deployment auditing can catch an overt saboteur |
| A3: Automated Alignment Agent | A3: Automated Alignment Agent |
| Automated Auditing | Building and Evaluating Alignment Auditing Agents |
| Automated Researchers Sandbag | Automated Researchers Can Subtly Sandbag |
| Believe It or Not | How Deeply do LLMs Believe |
| Bloom Auto Evals | Bloom: An Open Source Tool for Automated Behavioral Evaluations |
| Bumpers | Putting up Bumpers: A Summary |
| Challenges & Hopes | 3 Challenges and 2 Hopes |
| Cheap Monitors | Cost-Effective Constitutional Classifiers via Representation Re-use |
| Distill Paraphrases | Do reasoning models use their scratchpad like we do? Evidence from distilling paraphrases |
| Honesty Elicitation | Honesty and lie detection |
| Hot Mess of AI | The Hot Mess of AI |
| How to Alignment Faking | How to Replicate and Extend Our Alignment Faking Demo |
| Inoculation Prompting | Inoculation Prompting |
| Safeguards Research Team | Introducing Anthropic's Safeguards Research Team |
| Inverse Scaling | Inverse Scaling in Test-Time Compute |
| Modifying Beliefs via SDF | Modifying LLM Beliefs with Synthetic Document Finetuning |
| OpenAI Findings | Findings from a Pilot Anthropic--OpenAI Alignment Evaluation Exercise |
| Petri 2.0 | Petri 2.0 |
| Petri | Petri: An Open-Source Auditing Tool |
| Pretraining Data Filtering | Enhancing Model Safety through Pretraining Data Filtering |
| Persona Selection Model | The Persona Selection Model |
| Recommended Directions | Recommendations for Technical AI Safety Research Directions |
| Reward Hacking OOC | Training on Documents about Reward Hacking Induces Reward Hacking |
| Rogue Eval | A Toy Evaluation of Inference Code Tampering |
| Sabotage Risk Report | Anthropic's Pilot Sabotage Risk Report |
| Safety Cases | Three Sketches of ASL-4 Safety Case Components |
| Selective Gradient Masking | Knowledge Localization: Selective Gradient Masking |
| Strengthening Red Teams | Strengthening Red Teams |
| Stress-testing Model Specs | Stress-testing model specs |
| Subliminal Learning | Subliminal Learning: Language Models Transmit Behavioral Traits via Hidden Signals in Data |
| Subtle Reasoning | Training fails to elicit subtle reasoning |
| Summarization for Monitoring | Monitoring Computer Use via Hierarchical Summarization |
| Unsupervised Elicitation | Unsupervised Elicitation |
| Won't vs. Can't | Won't vs. Can't: Sandbagging-like Behavior from Claude Models |