When LLM agents fail in production, determining why and how to fix them is a significant challenge. At Atla, we've developed an approach to address this problem.
We've been diving deep into why agents fail and how to prevent those failures before they happen. Our latest research on τ-bench (a benchmark for tool-agent-user interactions) reveals patterns in agent failure modes and demonstrates how our EvalToolbox can improve agent performance by using evaluations in a real-time feedback system.
Explore a preview of our Atla EvalToolbox (launching soon) here, and sign up to join our user community.
Introduction
Traditional agent evals focus on aggregate success rates that obscure critical information about failure patterns. This approach leaves engineering teams manually reviewing thousands of lines of agent traces to diagnose issues, which is an unsustainable process as deployments scale. For production systems, knowing that an agent has a 50% success rate isn't sufficient. You need to efficiently understand and mitigate the 50% of cases where it fails.
Our EvalToolbox addresses this gap by automatically identifying, categorizing, and correcting common failure modes in agent workflows. By embedding real-time evaluation into the agent loop, we transform the traditional manual post-production analysis into a proactive system that not only diagnoses failures but automatically implements improvements to prevent them. This approach both improves reliability and dramatically reduces the engineering time spent reviewing agent traces, allowing teams to focus on core product development rather than failure remediation.
If you're building agent systems and struggling with reliability issues, we invite you to get in touch to explore how our agent eval approach can address your unique challenges.
Categorizing agent failure modes
Unlike traditional benchmarks that only report success rates, we dug into τ-retail, a subset focused on retail customer service, and cataloged why agents fail. Our analysis reveals a taxonomy of distinct failure patterns:

- Workflow Errors dominate the failure landscape, with “Wrong Action” making up the most failures in this category
- User Interaction Errors also occur frequently, and “Wrong Information” given to the user is the most common failure mode overall
- Tool Errors where a tool is called with “Wrong Arguments” make up the third major category
See Appendix for a full list and explanation of the failure modes.
What's particularly notable is the split between terminal failures (where the agent crashes and burns) and recovered failures (where the agent stumbles but eventually finds its way). In nearly every category, terminal failures significantly outnumber recoveries. This highlights how difficult it is for agents to recover from errors without external intervention. This is especially true for failure modes where terminal failures are more likely to occur, such as “wrong information” and “wrong action.”
The evaluation effect: auto-correction in action
After identifying these failure modes, we integrated Selene, our purpose-built evaluation model, directly into agent workflows to see if we could self-correct the agent in action.
Rather than merely measuring performance, Selene actively monitors each step of execution. When the correct action is taken, the agent proceeds; when the wrong action is taken, Selene gives feedback which guides the agent to correct its action.
Here’s an example where an agent makes a “wrong information” failure:
The side-by-side comparison shows:
- Without Selene: The agent persistently fails after its first “wrong information” error. It is ultimately unable to perform the customer’s desired action. ⭐⭐
- With Selene: The agent gains the ability to detect its errors and implements corrections before cascading failures occur. It correctly fulfills the customer’s desired action. ⭐⭐⭐⭐⭐
This approach fundamentally changes how agents operate. Our EvalToolbox acts as a real-time quality control system, catching errors at each step and providing actionable feedback that helps the agent course-correct.
From manual review to automated detection
Manually reviewing agent failures is prohibitively time-consuming. For companies deploying agents at scale, this creates a trade-off between reliability and development velocity.
Our EvalToolbox eliminates this trade-off by:
- Automating failure detection based on common failure modes
- Providing real-time feedback with critiques that explain what went wrong
- Enabling self-correction by injecting feedback into the agent's workflow for immediate improvement
Current capabilities & next steps
Our τ-bench research is an early confirmation that agent failures follow predictable patterns that can be detected and addressed systematically. Our toolbox can detect common failure modes across diverse agent architectures, provide specific actionable feedback for self-correction, and integrate into existing agent workflows.
We're continuing to refine our understanding of agent failure modes and how to address them more effectively. Our upcoming work focuses on:
- Extending benchmark coverage to more diverse task types, such as coding agents
- Domain-specific failure detection for specialized industries
- Creating a standardized protocol for agent evaluation-in-the-loop
Sign up to join our community, or get in touch to explore how we can help make your agents more reliable.
Appendix
