Autonomous AI agents are becoming increasingly central to applications in coding, customer support, and task automation, but a recurring challenge persists: they often fail. These failures are frequently opaque, and take hours of manual reviews and vibe checks to debug. We explored this in more detail in a previous blogpost on why agents continue to fail.
Our recent work aims to address this problem by offering a methodology to identify agent errors at step-level granularity. Building on our prior work with customer support agents in TAU-Bench, we have extended our analysis of error types into a new domain: coding agents. We analyzed traces from DA-Code, a benchmark for agent-based data science tasks, to expand our taxonomy of agent error types across domains.
Still, does understanding where and how agents fail help us then fix them? Through an actor-critic framework, we explore promising ways to intervene at the step-level to reliably improve agent outcomes.
🤖 If you’re building agents and want to cut time spent manually debugging, we’ve developed a tool to automatically identify agent error types. Reach out to learn more.
What kinds of errors do agents make?
When we analyzed how customer support agents were failing in TAU-Bench, we found that the errors fell into three broad categories: workflow errors, tool errors, and user interaction errors.
In DA-Code, agents are expected to solve complex programming tasks autonomously, requiring them to wrangle data, perform exploratory data analysis, and apply machine learning techniques. This context revealed a new class of errors: reasoning errors. Within reasoning errors, we elucidate three new fine-grained error types:
- Incorrect logic: generating incorrect logic (formal logic, coding logic, etc.).
- Hallucinated information: introducing made up pieces of information.
- Not following instructions: failing to adhere precisely to the given instructions.
Error distributions across benchmarks
Using our expanded taxonomy of error types, we manually annotated errors in DA-Code (see Appendix for the full taxonomy). The results showed that from the broad categories, reasoning errors constituted the majority of non-recoverable errors. Among reasoning errors, “incorrect logic” was the most common error type. At the same time, tool use errors and workflow errors were also present in DA-Code, and the remaining errors were evenly split among these two categories.

We also found that the majority of errors across both benchmarks are terminal, meaning they result in task failure. This eventually leads us to consider corrective interventions.

To understand how failure modes shift between domains, we analyzed distributions of both the coarse-grained error categories and fine-grained error types across TAU-Bench and DA-Code. The patterns revealed by these distributions highlight where each domain places distinct pressure on the agent.
TAU-Bench errors were fairly balanced, with significant representation across user interaction errors, workflow errors, and tool errors. DA-Code, on the other hand, was dominated by reasoning failures. Among fine-grained error types, “missing tool calls” was a recurrent issue across both benchmarks


As we expand our research into more domains, we aim to build out a taxonomy that captures a comprehensive understanding of agent errors to robustly support targeted debugging across workflows.
🤖 If you’re building AI agents and want to identify how they fail, we’ve developed a tool based on these findings. Reach out to test it on your workflows.
The actor-critic framework for agent improvement
Once we examined how agents made errors, we wanted to test if this knowledge could actually help improve agents’ performance. Would finding step-level errors and then fixing them allow agents to successfully complete their tasks?
To test this, we adopted an “actor-critic” approach from reinforcement learning, framing the problem as a partially observable Markov decision process (similar to [1], [2], [3]). In this setup, the “actor” is the agent in question and executes a sequence of steps to complete a task. A “critic” evaluates each step and provides feedback to the agent to help it course-correct. We used GPT-4o as our agent, and experimented with using frontier models as well as humans as the critics.
We applied this paradigm to both TAU-Bench and DA-Code using a subset of problems. We selected problems that: 1) had manually verified ground truths, 2) were “solvable” in that they had instructions that were specified enough to arrive at a ground truth, and 3) had a balanced mix of tasks that the agent reliably got right or wrong. This setup allowed us to test whether different kinds of critics, whether models or humans, could improve task success rates through mid-task intervention.
The results were surprising. Even strong models such as Claude 3.7 Sonnet and o4-mini struggled to meaningfully improve performance when inserted as critics. However, when humans played the role of critic by offering real-time, natural language feedback, the completion rate for these tasks increased dramatically, by as much as 30%, achieving up to 80-90% completion rates on the subsets of TAU-Bench and DA-Code.


Here’s an example of how we inserted humans-in-the-loop to guide agent behavior in DA-Code. In this example, the agent was tasked with classifying fighters by weight class using structured data. It initially failed to categorize certain entries because the predefined weight class boundaries lacked sufficient decimal precision. A human intervened with a targeted critique—pointing out that the weight values in the dataset used two decimal places—and the agent adjusted its logic accordingly.
This reflects a broader pattern we observed: critiques that are concrete and extremely precise lead to meaningful improvements in task completion. In our experiments, interventions like this enabled agents to recover from otherwise non-recoverable failures.

Notably, these improvements required no re-planning or re-prompting. The agent’s core capabilities were sufficient; it merely needed guidance at key failure points. The human-in-the-loop results serve as proof that there exists a set of gold-standard critiques that can enable a GPT-4o agent to reliably complete a subset of solvable and verifiable tasks. That is, a well-constructed critique can significantly boost agent performance by correcting step-level errors.
Toward automated critiques
Given the efficacy of human critics, we are working on improving automated methods to both effectively identify errors and provide actionable critiques to fix them.
In our actor-critic setup, adding a naive evaluator, such as a frontier model like o4-mini with a generic prompt, led to only marginal gains. By contrast, human-in-the-loop critics boosted completion rates by nearly 30 percentage points. This demonstrated that well-written critiques can significantly improve agent performance by helping the model recover from otherwise terminal errors.
We have started investigating how structured prompts might enable models to become better “critics”. By leveraging our error type taxonomy to guide prompt design, we observed measurable gains (close to 10%) over baseline critique strategies. Naive prompting alone won’t close the gap, but our early findings suggest that including error types in prompts can create precise critiques that nudge agents toward correction. We are continuing to experiment with this internally, and are working to bring this capability into future tooling.
Conclusion: from vibes to structure
Our work across TAU-Bench and DA-Code illustrates an approach to identifying AI agent errors, as well as promising directions for correcting errors. The key contributions include:
- An expanded error taxonomy that generalizes across domains
- A quantitative analysis of error distributions that informs debugging priorities
- An actor-critic framework for targeted interventions
- Early investigations into prompting for critics to enhance the chances of improving agents
Our methodology trades vibe checks for a principled path to understanding and fixing failures. We are currently working with leading teams building agents to help them automatically identify their agent failures. If you would like to explore this with us, get in touch.
Appendix
.png)