Skip to content

A3: An Automated Alignment Agent for Safety Finetuning

Authors: Jifan Zhang¹, Henry Sleight², Joe Benton³

Date: March 11, 2026

¹Anthropic Fellows Program; ²Constellation; ³Anthropic

Abstract

The paper presents A3, an agentic framework designed to automatically address safety failures in Large Language Models with minimal human oversight. Experimental results demonstrate that A3 successfully reduces safety failure rates on issues including sycophancy, political bias, and nesting jailbreaks, outperforming both non-adaptive baselines and other models in targeted safety evaluations.

Open Source: A3 is available at https://github.com/safety-research/A3

Research Context: Work completed as part of the Anthropic Fellows Program.


Introduction

Historically, addressing safety concerns in AI models has required extensive human involvement. The typical workflow involved humans identifying safety risks, defining desired behaviors, and creating datasets for model finetuning. Multiple iteration cycles were frequently necessary to fully resolve safety issues.

A3 introduces an automated framework for mitigating safety failures in existing LLMs with reduced human intervention requirements. This advancement builds upon prior work in automated auditing agents (including the open-source Petri agent and Bloom evaluation framework) and traffic monitoring techniques for identifying unsafe behaviors during deployment.

Given a user query and an example of undesired behavior discovered through auditing, A3 fixes the safety issue in the target model.

The A3 Pipeline Overview

The framework comprises three main elements:

  1. Data Generation Agent - Identifies the scope of safety risks by adaptively generating hypothetical user queries that could elicit similar undesired behavior
  2. Finetuning Agent - Iteratively and adaptively specifies weighted mixing strategies from generated training sets and post-training datasets
  3. Experiment Log - Maintains summaries of past data generation and finetuning experiments to enable agent adaptation

The core objectives are minimizing unsafe behavior, preventing catastrophic forgetting, and reducing false positive rates in the final model.


Key Features

Adaptive Data Generation: The system generates targeted examples rather than relying on static datasets, allowing it to comprehensively explore the safety issue's scope.

Automated Dataset Partitioning: Generated data is automatically divided into training, validation, and out-of-distribution evaluation sets.

Iterative Finetuning: The approach uses adaptive weighted mixing strategies to balance safety improvements against model capability preservation.


Results and Applications

A3 successfully addresses multiple safety concerns including:

  • Sycophancy reduction
  • Political bias mitigation
  • Nested jailbreak prevention

The framework demonstrates superior performance compared to non-adaptive baselines and alternative approaches on targeted safety evaluations.


Open Source Release

The full A3 codebase has been open-sourced to enable broader safety research and development across the AI community.