Skip to content

Writing Effective Tools for Agents — With Agents

Published Sep 11, 2025

Overview

Agents are only as effective as the tools we give them. This article shares techniques for writing high-quality tools and evaluations, and how to use Claude to optimize tools for agent performance.

The Model Context Protocol (MCP) can empower LLM agents with potentially hundreds of tools to solve real-world tasks. But how do we make those tools maximally effective?

Key Techniques Covered

The article describes how to:

  • Build and test prototypes of your tools
  • Create and run comprehensive evaluations of your tools with agents
  • Collaborate with agents like Claude Code to automatically increase tool performance
  • Identify key principles for writing high-quality tools

What is a Tool?

Tools represent a new kind of software reflecting a contract between deterministic systems and non-deterministic agents. Unlike traditional function calls that always produce identical outputs, agents can generate varied responses even with the same inputs.

When a user asks "Should I bring an umbrella today?," an agent might call a weather tool, answer from general knowledge, or ask clarifying questions. Agents occasionally hallucinate or misunderstand tool usage.

This requires fundamentally rethinking software design—instead of writing tools like traditional APIs for developers, design them specifically for agents. The goal is increasing the surface area where agents can effectively solve tasks using diverse strategies.

How to Write Tools

Building a Prototype

Start by creating a quick prototype of your tools. If using Claude Code to write tools, provide documentation for software libraries, APIs, or SDKs. LLM-friendly documentation often exists in llms.txt files on official documentation sites.

Wrap tools in a local MCP server or Desktop extension to connect and test them in Claude Code or the Claude Desktop app.

To connect to Claude Code: Run claude mcp add <name> <command> [args...]

To connect to Claude Desktop: Navigate to Settings > Developer or Settings > Extensions

Tools can also be passed directly into Anthropic API calls for programmatic testing.

Test the tools yourself to identify rough edges and collect user feedback about expected use cases.

Running an Evaluation

Measure how well Claude uses your tools by running evaluations. Start by generating lots of evaluation tasks grounded in real-world uses, then collaborate with agents to analyze results and determine improvements. The tool evaluation cookbook provides end-to-end guidance.

Generating Evaluation Tasks

Claude Code can quickly explore tools and create dozens of prompt-response pairs. Prompts should be inspired by real-world uses and based on realistic data sources and services. Avoid overly simplistic "sandbox" environments that don't stress-test tools with sufficient complexity.

Examples of strong tasks:

  • Schedule a meeting with Jane next week to discuss our latest Acme Corp project. Attach the notes from our last project planning meeting and reserve a conference room.
  • Customer ID 9182 reported that they were charged three times for a single purchase attempt. Find all relevant log entries and determine if any other customers were affected by the same issue.
  • Customer Sarah Chen just submitted a cancellation request. Prepare a retention offer. Determine: (1) why they're leaving, (2) what retention offer would be most compelling, and (3) any risk factors we should be aware of before making an offer.

Examples of weaker tasks:

  • Schedule a meeting with jane@acme.corp next week.
  • Search the payment logs for purchase_complete and customer_id=9182.
  • Find the cancellation request by Customer ID 45892.

Each evaluation prompt should be paired with a verifiable response or outcome. Verifiers can range from exact string comparison to Claude-based judgment. Avoid overly strict verifiers that reject correct responses due to formatting or punctuation differences.

Optionally specify the tools you expect agents to call, but avoid overspecifying strategies since multiple valid solution paths may exist.

Running the Evaluation

Run evaluations programmatically with direct LLM API calls, using simple agentic loops (while-loops with alternating LLM API and tool calls). Each agent should receive a single task prompt and access to tools.

In system prompts, instruct agents to output reasoning and feedback blocks before tool calls, which triggers chain-of-thought behaviors and increases effectiveness.

If running evaluations with Claude, enable interleaved thinking for similar functionality. This helps identify why agents do or don't call certain tools and reveals areas for improvement in tool descriptions and specifications.

Collect metrics beyond top-level accuracy: total runtime, number of tool calls, token consumption, and tool errors. Tracking tool calls reveals common workflows and consolidation opportunities.

Analyzing Results

Agents are helpful partners for identifying issues with tool descriptions, implementations, and schemas. However, note that omitted feedback can be more significant than included feedback.

Observe where agents get confused. Read through reasoning and feedback (or chain-of-thought) to identify rough edges. Review raw transcripts including tool calls and responses to catch behavior not explicitly described in the agent's reasoning.

Analyze tool calling metrics. Redundant calls suggest pagination or token limit adjustments; invalid parameter errors suggest clearer descriptions or examples are needed. When launching Claude's web search tool, the team identified that Claude was unnecessarily appending "2025" to search queries, biasing results. Improving the tool description resolved the issue.

Collaborating with Agents

Let agents analyze results and improve tools automatically. Concatenate evaluation transcripts and paste them into Claude Code. Claude excels at analyzing transcripts and refactoring multiple tools simultaneously while maintaining self-consistency.

Most advice in this article came from repeatedly optimizing internal tool implementations with Claude Code. Evaluations were created on internal workspaces mirroring real workflow complexity.

The team relied on held-out test sets to prevent overfitting to training evaluations. These revealed additional performance improvements beyond expert implementations, whether manually written or Claude-generated.

Principles for Writing Effective Tools

Choosing the Right Tools for Agents

More tools don't always improve outcomes. A common mistake is wrapping existing software functionality or API endpoints without considering whether they're appropriate for agents.

Agents have different "affordances" than traditional software—different ways of perceiving potential actions. LLM agents have limited context (processing capacity constraints), while computer memory is cheap and abundant.

Consider searching an address book. Traditional software efficiently processes contacts one at a time. However, if an agent receives all contacts and reads token-by-token, it wastes limited context space on irrelevant information. The better approach is skipping to relevant entries first, alphabetically.

Build a few thoughtful tools targeting specific high-impact workflows matching evaluation tasks, then scale. Rather than implementing list_contacts, consider search_contacts or message_contact tools.

Tools can consolidate functionality, handling multiple discrete operations under the hood. Examples include:

  • Instead of separate list_users, list_events, and create_event tools, implement a schedule_event tool that finds availability and schedules
  • Instead of read_logs, implement search_logs returning only relevant lines with context
  • Instead of separate get_customer_by_id, list_transactions, and list_notes tools, implement get_customer_context compiling recent relevant information

Each tool should have clear, distinct purpose. Tools should enable agents to subdivide and solve tasks like humans would with the same resources, while reducing context consumed by intermediate outputs.

Too many or overlapping tools distract agents from efficient strategies. Careful, selective planning of tools to build (or avoid) pays significant dividends.

Namespacing Your Tools

Agents will potentially access dozens of MCP servers and hundreds of tools, including those from other developers. Overlapping or vaguely-purposed tools cause confusion.

Namespacing (grouping related tools under common prefixes) helps delineate boundaries. For example, namespacing by service (e.g., asana_search, jira_search) and by resource (e.g., asana_projects_search, asana_users_search) helps agents select appropriate tools.

Prefix versus suffix-based naming has non-trivial effects on evaluations. Effects vary by LLM—choose naming schemes according to your own evaluations.

By implementing tools with names reflecting natural task subdivisions, you reduce tools and descriptions loaded into agent context while offloading computation from context to tool calls themselves. This decreases overall error risk.

Returning Meaningful Context from Tools

Tool implementations should return only high-signal information to agents, prioritizing contextual relevance over flexibility. Avoid low-level technical identifiers like uuid, 256px_image_url, mime_type. Instead, use name, image_url, file_type.

Agents handle natural language names, terms, and identifiers significantly more successfully than cryptic ones. Resolving arbitrary alphanumeric UUIDs to semantically meaningful language (or even 0-indexed IDs) significantly improves Claude's precision in retrieval tasks by reducing hallucinations.

Sometimes agents need flexibility to interact with both natural language and technical identifiers for downstream calls (e.g., search_user(name='jane') -> send_message(id=12345)). Enable this by exposing a response_format enum parameter allowing agents to control whether tools return "concise" or "detailed" responses.

ResponseFormat enum example:

enum ResponseFormat {
   DETAILED = "detailed",
   CONCISE = "concise"
}

Detailed responses provide comprehensive information with technical identifiers (206 tokens), while concise responses return only essential content (72 tokens). This allows agents to retrieve IDs for further tool calls while conserving tokens when only content is needed.

Tool response structure—XML, JSON, or Markdown—also impacts evaluation performance. No one-size-fits-all solution exists since LLMs perform better with formats matching training data. Select response structures based on your own evaluations.

Optimizing Tool Responses for Token Efficiency

While context quality matters, so does context quantity. Implement pagination, range selection, filtering, and/or truncation with sensible default parameters for responses consuming significant context. Claude Code restricts responses to 25,000 tokens by default.

As effective agent context lengths grow, context-efficient tools remain essential.

When truncating responses, steer agents with helpful instructions. Directly encourage more token-efficient strategies like making multiple targeted searches rather than single broad searches. When errors occur, prompt-engineer responses to communicate specific, actionable improvements rather than opaque error codes or tracebacks.

Example of a truncated response:

Shows a query response with pagination guidance: "Returned top 10 results. Use offset parameter for more results."

Example of unhelpful error response:

Returns: Error: 400 Bad Request

Example of helpful error response:

Returns: Error: Invalid customer_id format. Expected numeric ID (e.g., customer_id="12345"), received: "jane". Try searching by name first: search_customers(name="jane")

Truncation and error responses steer agents toward more token-efficient behaviors or provide correctly-formatted input examples.

Prompt-Engineering Your Tool Descriptions

One of the most effective improvement methods is prompt-engineering tool descriptions and specifications. These load into agent context and collectively steer agents toward effective tool-calling behaviors.

When writing descriptions and specs, consider how you'd describe tools to new team members. Make implicit context explicit—specialized query formats, niche terminology definitions, resource relationships.

Avoid ambiguity by clearly describing expected inputs and outputs with strict data models. Input parameters should be unambiguously named: use user_id rather than user.

Evaluate the impact of prompt engineering refinements. Even small improvements to tool descriptions yield dramatic improvements. Claude Sonnet 3.5 achieved state-of-the-art performance on SWE-bench Verified after precise tool description refinements, dramatically reducing errors and improving task completion.

Additional best practices appear in the Developer Guide. If building tools for Claude, read about how tools load dynamically into Claude's system prompt. For MCP servers, tool annotations disclose which tools require open-world access or make destructive changes.

Looking Ahead

Building effective agent tools requires reorienting software development from predictable, deterministic patterns to non-deterministic ones.

Through the iterative, evaluation-driven process described, consistent patterns emerge for successful tools: they are intentionally and clearly defined, use agent context judiciously, combine together in diverse workflows, and enable agents to intuitively solve real-world tasks.

As mechanisms through which agents interact evolve—from MCP protocol updates to underlying LLM improvements—systematic, evaluation-driven approaches to improving tools ensure they evolve alongside increasingly capable agents.

Acknowledgements

Written by Ken Aizawa with contributions from colleagues across Research, MCP, Product Engineering, Marketing, Design, and Applied AI teams.