Identifying & auto-correcting agent failures: findings from TAU-bench

Nina
April 29, 2025
t-bench

When LLM agents fail in production, determining why and how to fix them is a significant challenge. At Atla, we've developed an approach to address this problem.

We've been diving deep into why agents fail and how to prevent those failures before they happen. Our latest research on τ-bench (a benchmark for tool-agent-user interactions) reveals patterns in agent error types and demonstrates how our EvalToolbox can improve agent performance by using evaluations in a real-time feedback system.

Explore a preview of our Atla EvalToolbox (launching soon) here, and sign up to join our user community.

Introduction

Traditional agent evals focus on aggregate success rates that obscure critical information about failure patterns. This approach leaves engineering teams manually reviewing thousands of lines of agent traces to diagnose issues, which is an unsustainable process as deployments scale. For production systems, knowing that an agent has a 40% success rate isn't sufficient. You need to efficiently understand and mitigate the 60% of cases where it fails.

Our EvalToolbox addresses this gap by automatically identifying, categorizing, and correcting common error types in agent workflows. By embedding real-time evaluation into the agent loop, we transform the traditional manual post-production analysis into a proactive system that not only diagnoses failures but automatically implements improvements to prevent them. This approach both improves reliability and dramatically reduces the engineering time spent reviewing agent traces, allowing teams to focus on core product development rather than failure remediation.

If you're building agent systems and struggling with reliability issues, we invite you to get in touch to explore how our agent eval approach can address your unique challenges.

Categorizing agent error types

Unlike traditional benchmarks that only report success rates, we dug into τ-retail, a subset focused on retail customer service, and cataloged why agents break. Our analysis reveals a taxonomy of distinct failure patterns:

  • Workflow Errors dominate the failure landscape, with “Wrong Action” making up the most failures in this category
  • User Interaction Errors also occur frequently, and “Wrong Information” given to the user is the most common failure mode overall
  • Tool Errors where a tool is called with “Wrong Arguments” make up the third major category

See Appendix for a full list and explanation of the error types.

What's particularly notable is the split between terminal failures (where the agent crashes and burns) and recovered errors (where the agent stumbles at a step but eventually finds its way). In nearly every category, terminal failures significantly outnumber recoveries. This highlights how difficult it is for agents to recover from errors without external intervention. This is especially true for error types where terminal failures are more likely to occur, such as “wrong information” and “wrong action.”

The evaluation effect: auto-correction in action

After identifying these error types, we integrated Selene, our purpose-built evaluation model, directly into agent workflows to see if we could self-correct the agent in action. 

Rather than merely measuring performance, Selene actively monitors each step of execution. When the correct action is taken, the agent proceeds; when the wrong action is taken, Selene gives feedback which guides the agent to correct its action. 

Here’s an example where an agent makes a “wrong information” failure:

The side-by-side comparison shows:

  • Without Selene: The agent persistently fails after its first “wrong information” error. It is ultimately unable to perform the customer’s desired action. ⭐⭐
  • With Selene: The agent gains the ability to detect its errors and implements corrections before cascading errors occur. It correctly fulfills the customer’s desired action. ⭐⭐⭐⭐⭐

This approach fundamentally changes how agents operate. Our EvalToolbox acts as a real-time quality control system, catching errors at each step and providing actionable feedback that helps the agent course-correct. 

From manual review to automated detection

Manually reviewing agent errors is prohibitively time-consuming. For companies deploying agents at scale, this creates a trade-off between reliability and development velocity.

Our EvalToolbox eliminates this trade-off by:

  1. Automating error detection based on common error types
  2. Providing real-time feedback with critiques that explain what went wrong
  3. Enabling self-correction by injecting feedback into the agent's workflow for immediate improvement

Current capabilities & next steps

Our τ-bench research is an early confirmation that agent errors follow predictable patterns that can be detected and addressed systematically. Our toolbox can detect common error types across diverse agent architectures, provide specific actionable feedback for self-correction, and integrate into existing agent workflows.

We're continuing to refine our understanding of agent error types and how to address them more effectively. Our upcoming work focuses on:

  • Extending benchmark coverage to more diverse task types, such as coding agents
  • Domain-specific failure detection for specialized industries
  • Creating a standardized protocol for agent evaluation-in-the-loop

Sign up to join our community, or get in touch to explore how we can help make your agents more reliable.

Appendix

Automatically detect errors in your AI with Atla.
Download our OSS model
Start for free
Start for free