Skip to content

Monitoring Computer Use via Hierarchical Summarization

Introduction

Anthropic's October 2024 launch of computer use capabilities marked a significant shift from chatbot technology to AI systems capable of performing real-world tasks like process automation and market research. This advancement required implementing safety measures to address both known risks (such as spam and fraud) and potential emerging harms that might only become apparent after deployment.

The challenge is particularly acute because large-scale computer usage can create aggregate harms. Individual interactions that appear benign—such as a user asking Claude to click a button on a website—could represent either legitimate UI testing or participation in click fraud schemes defrauding advertising systems.

The Hierarchical Summarization Approach

To address these monitoring challenges, Anthropic developed hierarchical summarization for monitoring. The method works by:

  1. Summarizing individual interactions first
  2. Creating summaries of those summaries to provide high-level usage pattern overviews

This approach improves triage and facilitates human review of potentially violative content, substantially increasing observability across both anticipated and emergent harms. Importantly, all summaries receive the same access controls as original text.

Key Features

The technique addresses the limitations of standard AI monitoring approaches, which typically train classifiers to identify individually harmful interactions. This method better handles:

  • Known unknowns: Anticipated risk categories
  • Aggregate harms: Patterns that emerge only when examining multiple interactions collectively
  • Emergent risks: Unforeseen abuse patterns

Acknowledgments

The work represents collaboration between Anthropic's Safeguards Research Team, with contributions from multiple researchers including Theodore Sumers, Raj Agarwal, Nathan Bailey, Tim Belonax, and others.

The authors note this is "early research with room to improve and evolve," positioning it as a promising technique for safeguarding frontier AI systems rather than a final solution.