Despite recent months of headline-grabbing advances, LLM agents still feel fragile. They can find and summarize research papers, write executable code, and complete multi-step reasoning tasks, but give agents slightly unfamiliar inputs or open-ended objectives, and they often fail in frustrating ways.
If you’re building with LLM agents today, you’ve likely felt this gap between what they promise and what they actually deliver. This post dives into why LLM agent performance continues to lag behind expectations, and why debugging agents is one of the most challenging problems in AI development.
The Performance Illusion
Agents look capable until they don’t
The latest generation of LLM models show impressive performance on many popularly-cited agent benchmarks. Claude 3.7 Sonnet achieves top performance on SWE-bench Verified, an agentic coding benchmark, with a resolution rate of 62.3%, and 70.3% with Anthropic’s own custom scaffolding over a reduced subset of problems1. On the retail agentic tool use benchmark, TAU-bench (retail), Claude 3.7 Sonnet achieves a 81.2% pass^1 completion score2, while o1 also scores an impressive 73.5%. Qwen 3 and Mixtral, among open-source models, also display significant improvements in tool integration and planning tasks.

Yet, benchmark excellence, particularly on these popularly-cited benchmarks, does not guarantee agent robustness in the wild.
1 https://www.anthropic.com/news/claude-3-7-sonnet
2 Pass^k scores measure the agent's reliability and determine if it can successfully complete the same task multiple times (k representing the number of different trials). https://sierra.ai/blog/benchmarking-ai-agents.
Real-world deployments surface fragility
With benchmarks that test more complex, multi-step tasks, agents begin to show failure. Researchers at Carnegie Mellon University and Duke University tested frontier models on TheAgentCompany, a benchmark mimicking a range of real-world software developer tasks. Unlike SWE-bench, it goes beyond GitHub issues, and includes a range of tasks including browsing the web, writing code, running programs, and communicating with other coworkers. They found average success rates for all models were below 31%. When inspecting Claude 3.5 Sonnet and Llama 3.1 405B more closely for individual task success rates, they found that task completion rates fall under 41%, with some tasks being totally unsuccessful at 0%.

In production settings, these shortcomings become more than just percentages. AI customer service agents hallucinate policies. Coding assistants generate boilerplate correctly, but miss edge cases, causing failures that slip past testing and break with users in production. In one anecdotal example, Cursor’s support agent confidently told a user that using the software on multiple machines was against company policy. In actuality, no such policy existed. The hallucination led to confusion and downstream complaints that required human escalation and the co-founder’s public response.

So what’s holding LLM agents back? Let’s unpack the key structural weaknesses.
The Limits of Current LLM Agent Architectures
1. No Long-Term Memory or Stable World Model
Despite larger context windows (up to 1M tokens in Gemini 2.5) and retrieval mechanisms, current agents still operate with short-term, prompt-bound memory. They forget goals, repeat actions, or contradict themselves over time. External memory systems and vector databases help but remain error-prone and require careful orchestration.
2. Lack of Determinism and Control
LLMs are inherently probabilistic generators. The same prompt can yield different results across runs. This makes debugging and reliability challenging when done in an unstructured way. A fix that works in one instance might fail the next.
3. Error Types Are Repetitive but Hard to Detect
Our research on TAU Bench and DA Code (forthcoming) show that most agent errors fall into a set of categories, which include incorrect logic, missing tool calls, instruction-following errors, and tool misuse. In addition, many of these errors result in terminal failures, wherein agents can’t recover once they go off-track. These patterns highlight the need for structured error taxonomies and better detection tools to surface impactful errors quickly.
4. Underdeveloped Tool Use
Agents interface with tools like web search, calculators, and APIs. While models like Claude and Gemini support tool-calling, execution remains fragile and our research suggests that tool-related errors are one of the most frequent error types. In benchmarks designed for tool use, such as the m&m’s benchmark, models often fail to produce valid, multi-step tool plans—struggling with argument prediction, sequencing, and execution reliability.
Fixing LLM Agent Failures Is Challenging
For developers, the most frustrating challenge might not actually be that agents fail in the first place, but rather how difficult it is to understand and debug those failures.
As LLM agents scale in complexity, across multi-step planning, tool chaining, and dynamic code generation, they introduce a fundamental challenge: credit assignment. That is, determining which steps in an execution trace contributed to eventual success or failure. In simple agents, this may be obvious. In complex agents with branching logic and tool use, localized errors may lead to cascading failures, while visible missteps might not affect outcomes at all.

This disconnect between local actions and global results makes debugging especially difficult. Developers are often forced to review entire execution traces manually in order to debug their application. Recent surveys from Microsoft and Cisco have shown that developers spend close to 60% of their time debugging code rather than building new features, pointing to a need for better observability into failures. Think about that: some of the most seasoned engineers and advanced researchers are spending the bulk of their time combing through logs just to figure out where an agent tripped up, as opposed to working on developing and improving these agents.

Let’s lay out the key challenges.
1. The More Autonomy, the More Complexity
Modern LLM agents generate and execute code, plan across multiple tools, and even revise their own goals. Debugging these systems means debugging a dynamic, self-altering program.
2. Failure Is Often Distributed and Delayed
A failure may appear at step 10 but stem from a subtle error at step 2. Tracing the root cause across a multi-step plan is difficult. Local fixes may not change outcomes.
3. Non-Reproducibility Makes Bugs Elusive
Because model outputs vary run to run, bugs can be transient. One fix might appear to work by chance. Another might fail only under specific inputs.
4. Traditional Tooling Doesn’t Scale
Current observability tools mostly offer logging and traces, which scales poorly as agents grow more complex. As agents produce longer chains, use more tools, and gain autonomy, a smarter and more automatic way to view insights from such logs will become necessary.
Better Observability and Control

To build robust LLM agents, developers need more than raw model capability. They need clarity.
Here’s what’s becoming essential:
- Execution Trace Analysis: Rich, interpretable traces of thought-action-observation sequences.
- Structured Failure Taxonomies: Categorized, repeatable understanding of why failures occur (e.g., planning failures, context drift, tool misuse).
- Failure Reproduction Aids: Tools that capture execution traces, inputs, and random seeds to approximate failure conditions.
- Long-Term Memory Architectures: Persistent memory structures that outlive the context window.
We’re starting to see promising early work in these areas, from open-source evaluation frameworks to research into critic models and observability systems.
The Future of LLM Agents Depends on Better Debugging
LLM agents are becoming more powerful, but also more opaque, less predictable, and more failure-prone. As they take on more autonomy and real-world responsibility, the cost of failure rises dramatically.
Yet our tools for understanding, debugging, and guiding these agents are still rudimentary. We can’t scale agent capabilities without also scaling oversight. If we want LLM agents to act reliably in open-ended environments, we need to start treating them as systems that aren't just in need of vibe checks but are in need of clearer diagnostics.
Breakthroughs in memory, planning, tool use, and interpretability will help. But so will better debugging frameworks: ones that reduce developer burden, surface root causes, and make AI behavior legible.
We’re working closely with a few leading teams to refine our agent debugging tools. If you're building and want to help shape what comes next, we’d love to talk–get in touch.