Why Deep Research Agents Fail: Lessons from GAIA

Sashank

July 10, 2025

Autonomous AI agents are increasingly being deployed to tackle cognitively demanding, multi-step tasks. A great example of this is the “browser” or “deep research” agentic pattern. In this pattern, a user asks a question that typically involves extensive online research (e.g. “what can I see on a 7-day trip to London in July?”) in response to which an agent might search the internet for promising leads, visit websites, collate material/references/links, and distill these into a comprehensive and up-to-date summary.

Yet even the most sophisticated deep research agents remain brittle. In this study we investigate why, using GAIA, a popular benchmark for general purpose AI assistants in which over 75% of tasks require web browsing capabilities. As our agent, we used Open Deep Research, an open-source multi-agent system from huggingface’s smolagents library, that achieves competitive performance on GAIA. Our goal was to understand the reasons for failures and identify opportunities for structured intervention.

This research builds directly on our prior work with DA-Code, a coding agent benchmark, and TAU-Bench, a customer service agent benchmark. GAIA introduces new challenges—most notably through its open-ended scope, planning requirements, and multi-agent communication—that shed new light on where and why research agents break down.

We are building an evaluation and improvement platform for agents based on this research. To use it to find and fix your agent failures, book a demo.

A New Benchmark, A New Class of Errors

The GAIA benchmark emphasizes tasks that require information retrieval and reasoning in open-ended internet environments. To make the problem tractable, we focus on a text-only subset of GAIA level 1 tasks—similar to prior work like TRAIL. Even within this constrained set, Open Deep Research fails on roughly 40% of tasks, consistent with its overall performance on the entire set of level 1 tasks.

We attempted to take a principled approach to annotating errors in GAIA by first open-coding and then taxonomizing errors (closely aligned with the error analysis approach of Shankar & Husain). We performed an initial analysis of Open Deep Research traces and refined our taxonomy with additional error types as necessary.

The resulting error taxonomy includes familiar categories from our prior work—such as reasoning and tool-use errors—but also introduces new classes specific to planning and multi-agent coordination. Two new types of errors stood out:

Wrong information to agent: a communication error where one sub-agent relays incorrect information to another, a failure mode unique to multi-agent systems.
Planning errors: a category of errors unique to agents that generate explicit plans about future steps, and which include errors such as “inadequate plan generation.”

These new types reflect GAIA’s unique demands on agent autonomy and coordination.

Figure 1: example of a Planning error, specifically an "inadequate plan generation" error

Why Browser Agents Break: Distribution of Error Types

After our initial analysis, we then asked a group of annotators to go through failed traces step-by-step. Annotators first identify whether a step contains an error, describe the nature of that error, and then categorize it using our fine-grained error taxonomy. We also gave annotators the option of picking an “other” category if their description did not match any of the existing categories.

Figure 2: Distribution of error types in GAIA, analyzed at the step level using our fine-grained taxonomy

To understand why browser agents fail, we compiled the distribution of error types across the annotated GAIA traces. The dominant sources of failure are:

Planning Errors — especially missing tool calls, where the agent skips a critical action (e.g., failing to scroll down a page to find information).
Reasoning Errors — including incorrect logic and hallucinated information, which reflect flawed inference or premature conclusions about page content.

Notably, the “other” category—used when annotators felt none of the predefined error types applied—accounted for only ~10% of annotations. This suggests that the taxonomy, while not exhaustive, captures the overwhelming majority of real failure modes.

Planning-related errors—especially omitted tool calls—are the most common and consequential form of failure in Open Deep Research. Designing agents that can plan and follow through remains an unsolved problem.

Finding the Breakpoint: Terminal Error Localization

Understanding where an agent fails in a task is just as important as understanding why. To this end, we evaluated inter-annotator agreement on the location of terminal errors—defined as the step in the trace from which the agent could no longer recover and ultimately failed the task.

Annotators consistently agreed on these locations, with Krippendorff’s alpha reaching 0.8 for terminal error spans. Figure 3 highlights how most of the time, independent annotators agree about the location of the terminal error to within 0 steps i.e. they usually converge on the exact same step. Figure 4 further shows where individual annotators located errors in an example trace, with most selecting “terminal error” at step 11.

This finding suggests that identifying terminal error location is highly reliable. We are excited about this as it could offer developers high leverage targets to prioritize and fix the errors that cause agents to break.

Can We Fix These Errors? Evaluator-in-the-Loop

To test whether these failures are correctable, we inserted humans and models as evaluators into the agent’s loop using smolagents’ step-level callbacks. These evaluators could intervene mid-trace, inspecting and potentially correcting the agent’s behavior before it failed.

We focused on a subset of “reliably failing” GAIA tasks—those which the agent failed all 10 times across multiple runs, but which we confirmed were solvable. The results were striking:

Human evaluators improved success rates from 58.6% to 78.6%.
Model evaluators (Claude 3.7 Sonnet and o4-mini) had negligible impact, performing on par with baseline.

Figure 5: agent performance with evaluators-in-the-loop, comparing models and humans

This mirrors our previous findings in DA-Code, where human critiques—especially when targeted and precise—helped agents recover from otherwise terminal errors. Even though our attempts to use large language models as evaluators were less successful, we are excited to continue researching ways to close this gap by building evaluators that give more precise and actionable feedback. We are already starting to see promising results (10-20%) on TAU-Bench with more focused evaluators in the loop.

To stay informed, follow Atla on LinkedIn and X.

Lessons for Developers and Researchers

GAIA reaffirms a pattern we observed in DA-Code and TAU-Bench, but at a more ambitious scale. As agents take on more complex tasks—planning, coordination, and reasoning—their errors become both richer and more difficult to anticipate. However, several clear lessons emerge for practitioners:

Prioritize planning and tool use: These are the most common error types in deep research agents. Addressing skipped steps or vague plans yields high leverage.
Focus on terminal errors: Knowing where the exact error that caused failure is reliable and useful for prioritizing.
Use human feedback to close the loop: Structured mid-execution critiques from humans can dramatically improve outcomes.

We are continuing to build tooling that leverages our research for teams building real-world agents. If you’re working on browser or deep research agents and want to explore better debugging and evaluation tools, we’d love to hear from you.

Book a demo to see how Atla can identify your agent’s errors, surface patterns, and suggest improvements.