Skip to content

Alignment Science (对齐科学)

Anthropic 对齐与安全研究,涵盖睡眠代理、红队、监控、审计、奖励黑客等方向。

文章标题
Activation OraclesActivation Oracles
Alignment Faking MitigationsAlignment Faking Mitigations
Alignment Faking RevisitedAlignment Faking Revisited: Improved Classifiers and Open Source Extensions
Fellows Program 2026Anthropic Fellows Program 2026
Fellows ProgramIntroducing the Anthropic Fellows Program for AI Safety Research
AuditBenchAuditBench
Auditing MO ReplicationOpen Source Replication of the Auditing Game Model Organism
Auditing Overt SaboteurPre-deployment auditing can catch an overt saboteur
A3: Automated Alignment AgentA3: Automated Alignment Agent
Automated AuditingBuilding and Evaluating Alignment Auditing Agents
Automated Researchers SandbagAutomated Researchers Can Subtly Sandbag
Believe It or NotHow Deeply do LLMs Believe
Bloom Auto EvalsBloom: An Open Source Tool for Automated Behavioral Evaluations
BumpersPutting up Bumpers: A Summary
Challenges & Hopes3 Challenges and 2 Hopes
Cheap MonitorsCost-Effective Constitutional Classifiers via Representation Re-use
Distill ParaphrasesDo reasoning models use their scratchpad like we do? Evidence from distilling paraphrases
Honesty ElicitationHonesty and lie detection
Hot Mess of AIThe Hot Mess of AI
How to Alignment FakingHow to Replicate and Extend Our Alignment Faking Demo
Inoculation PromptingInoculation Prompting
Safeguards Research TeamIntroducing Anthropic's Safeguards Research Team
Inverse ScalingInverse Scaling in Test-Time Compute
Modifying Beliefs via SDFModifying LLM Beliefs with Synthetic Document Finetuning
OpenAI FindingsFindings from a Pilot Anthropic--OpenAI Alignment Evaluation Exercise
Petri 2.0Petri 2.0
PetriPetri: An Open-Source Auditing Tool
Pretraining Data FilteringEnhancing Model Safety through Pretraining Data Filtering
Persona Selection ModelThe Persona Selection Model
Recommended DirectionsRecommendations for Technical AI Safety Research Directions
Reward Hacking OOCTraining on Documents about Reward Hacking Induces Reward Hacking
Rogue EvalA Toy Evaluation of Inference Code Tampering
Sabotage Risk ReportAnthropic's Pilot Sabotage Risk Report
Safety CasesThree Sketches of ASL-4 Safety Case Components
Selective Gradient MaskingKnowledge Localization: Selective Gradient Masking
Strengthening Red TeamsStrengthening Red Teams
Stress-testing Model SpecsStress-testing model specs
Subliminal LearningSubliminal Learning: Language Models Transmit Behavioral Traits via Hidden Signals in Data
Subtle ReasoningTraining fails to elicit subtle reasoning
Summarization for MonitoringMonitoring Computer Use via Hierarchical Summarization
Unsupervised ElicitationUnsupervised Elicitation
Won't vs. Can'tWon't vs. Can't: Sandbagging-like Behavior from Claude Models