Raising the bar on SWE-bench Verified with Claude 3.5 Sonnet

Overview

SWE-bench is an AI evaluation benchmark that assesses a model's ability to complete real-world software engineering tasks. The upgraded Claude 3.5 Sonnet achieved 49% on SWE-bench Verified, surpassing the previous state-of-the-art model's 45%. This represents progress in how AI systems can resolve actual GitHub issues from popular open-source Python repositories.

What is SWE-bench?

SWE-bench tests how AI models can understand, modify, and test code before submitting proposed solutions. Each task involves:

A set up Python environment
A repository checkout from before an issue was resolved
Grading against real unit tests from the pull request that closed the issue

The benchmark evaluates not just the model in isolation, but an entire "agent" system—the combination of an AI model and its supporting scaffolding. Performance can vary significantly based on this scaffolding, even with the same underlying model.

Why SWE-bench Matters

It uses real engineering tasks from actual projects, not competition-style questions
Significant room for improvement remains—no model has yet exceeded 50%
It measures entire agent systems rather than models alone, allowing developers and startups to optimize performance through better scaffolding

SWE-bench Verified is a 500-problem human-reviewed subset ensuring tasks are solvable, providing the clearest performance measure.

Agent Design Philosophy

The approach prioritized giving maximum control to the language model while maintaining minimal scaffolding. The agent consists of:

A prompt with suggested approach steps
A Bash Tool for executing commands
An Edit Tool for viewing and editing files

The model continues sampling until it decides it's finished or reaches its 200k context limit.

The Prompt

The prompt outlines a suggested approach but allows the model to determine its own path:

<uploaded_files>
{location}
</uploaded_files>

Can you help me implement the necessary changes to the repository so that the
requirements specified in the <pr_description> are met?

Follow these steps to resolve the issue:
1. Explore the repo structure
2. Create a script to reproduce the error
3. Edit the source code
4. Rerun your reproduce script
5. Consider edge cases

Bash Tool

The tool executes bash commands with instructions covering:

Command parameter handling
No internet access
Access to package mirrors
Persistent state across calls
Line range inspection capabilities
Guidance on avoiding excessive output

Edit Tool

This complex tool handles file operations with commands for:

View (display file contents or directory structure)
Create (new files only)
Str_replace (exact string matching replacement)
Insert (append after specific line)
Undo_edit (revert changes)

The tool requires absolute paths and performs exact string matching—replacement only occurs if there's exactly one match.

Performance Results

The updated Claude 3.5 Sonnet demonstrates higher reasoning and coding abilities compared to prior versions:

Model	SWE-bench Verified Score
Claude 3.5 Sonnet (new)	49%
Previous SOTA	45%
Claude 3.5 Sonnet (old)	33%
Claude 3 Opus	22%

Agent Behavior Example

A typical workflow demonstrates the model's approach to problem-solving:

Exploration: The model views repository structure to understand context
Reproduction: Creates and runs a script that reproduces the issue
Implementation: Modifies source code to address the problem
Verification: Reruns tests to confirm the fix works

In one example resolving a RidgeClassifierCV parameter issue, the model:

Explored the sklearn repository structure
Created a reproduction script showing the TypeError
Identified that the store_cv_values parameter wasn't passed to the base class
Modified the __init__ method to accept and forward the parameter
Verified the fix resolved the issue

Some tasks completed in 12 steps, while others required hundreds of turns and over 100k tokens before reaching successful resolution.

Key Challenges

Duration and Token Costs: Successful runs often required hundreds of turns and substantial token expenditure. While the model demonstrates tenacity in problem-solving, this can become expensive.

Grading Complexity: Environment setup issues and patch application problems occasionally masked correct model behavior, requiring careful investigation to distinguish genuine failures from infrastructure problems.

Hidden Tests: Models cannot see their grading tests, leading to false confidence when solutions don't match original unit test expectations. Some failures stem from solving problems at the wrong abstraction level rather than fundamental misunderstanding.

Multimodal Limitations: Despite strong vision capabilities, the implementation lacked file viewing or URL image display. This particularly hindered Matplotlib-related tasks and increased hallucination risks. SWE-bench has launched a separate multimodal evaluation addressing this gap.

Self-Correction and Persistence

The updated Claude 3.5 Sonnet demonstrates improved self-correction compared to earlier versions. It can try multiple solution approaches rather than repeating the same mistakes, showing genuine problem-solving tenacity when given sufficient context.

Conclusion

Claude 3.5 Sonnet achieved 49% on SWE-bench Verified using straightforward prompting and two general-purpose tools. The authors express confidence that developers will discover additional optimization strategies to further improve these results, suggesting significant untapped potential for agentic coding systems.

Raising the bar on SWE-bench Verified with Claude 3.5 Sonnet ​

Overview ​

What is SWE-bench? ​

Why SWE-bench Matters ​

Agent Design Philosophy ​

The Prompt ​

Bash Tool ​

Edit Tool ​

Performance Results ​

Agent Behavior Example ​

Key Challenges ​

Self-Correction and Persistence ​

Conclusion ​