Beyond Basic Observability: How Fieldy uses Atla alongside LangSmith to ship agent improvements twice as fast

Atla team

September 1, 2025

"I think the top factor would be for me how much time I save and also stress.
Half of our improvements would not be done if not for Atla...”

Karolis Mariunas — AI Engineer at Fieldy

‍

Background

Trusted by professionals to capture and organize their daily conversations with over $1.5M in sales, Fieldy is an AI note-taker that securely captures conversations and transforms them into reminders, summaries, and tasks — building a comprehensive personal memory system with short and long-term memory about users and their connections.

Supporting 100+ languages with features like smart memories, chat functionality, and tailored reminders, Fieldy’s AI assistant operates a sophisticated multi-agent system:

Reminder agents that extract tasks and commitments from long conversations with precise time handling
Chat agents that answer questions about past meeting notes using semantic search tools to retrieve relevant conversation memories
Customer support agents that handle device troubleshooting and app functionality questions
Summarization agents that generate customized summaries for different meeting types as directed by users

‍

The Challenge

As Fieldy enhanced its conversational AI assistant to provide more personalized responses with longer memory windows and improved user experiences, the development team encountered unique obstacles:

1) Agent consistency issues creating user friction:

Fieldy’s Chat agent showed inconsistent tool usage patterns, sometimes entering query loops when information didn't exist in the database or making multiple redundant tool calls instead of asking users for clarification. With tens of thousands of traces per month being logged into LangSmith, manually analyzing these behaviors to identify patterns was overwhelming, even with the help of custom LLM judge metrics.

"Otherwise I would need to do this manually, analyze the chat conversation traces, which is very hard to do when you're going at it from scratch."

2) Context mixing degrading response quality:Users of the wearable note-taker sometimes experienced frustrating delays and less accurate responses when the agent retrieved large parts of irrelevant conversations, filling up the context window and diluting the relevance of answers about their personal history.

3) Lack of systematic evaluation for real world usage:

While the team had static test sets for evaluation, they lacked visibility into how changes performed against real-world edge cases in production. This made deployments feel uncertain.

‍

The Result: Streamlined Agent Development with Atla

1) Effortless integration into existing workflow:Fieldy integrated Atla seamlessly and within a couple days of processing production traffic, gained clear visibility into error patterns across their entire system.

2) High-impact improvements identified and shipped:

Atla's error pattern detection enabled Fieldy to quickly identify and resolve critical issues:

Tool call consistency: Reduced redundancy from frequent occurrences to <0.5% of traces by refining system prompts after setting up custom metrics on Atla to track multiple tool call patterns
Agent routing efficiency: Eliminated unnecessary agent handoffs and reduced latency by removing the routing selector agent that was causing confusion about appropriate transfer timing
Context optimization: Resolved mixing problems by refining their reranker and switching from memory summaries to full conversation transcripts for specific information queries

These improvements led to more consistent agent behavior and enabled users to reliably access information from their expanding personal conversation archive, captured by the voice recorder.

[The new agent architecture] is faster, cleaner. A lot less errors.

3) Confident development workflow with growing agent complexity:

Rather than getting lost in individual traces, Karolis now uses Atla's error patterns to systematically understand which issues occur most frequently in production. After shipping changes, he uses the Compare tab to visualize performance differences across different versions.

"I like conceptually being able to compare them on a very specific level, but also seeing the performance of the whole system, V1 versus V2.”

By providing systematic error detection in production, Atla transformed Fieldy's development approach from chaotic, uncertain iterations to systematic agent improvements with clear before/after comparisons.

By adopting Atla, Fieldy shifted from reactive debugging to systematic error pattern analysis, maintaining quality while confidently enhancing advanced personalization capabilities like extended memory windows.

‍

Ready to accelerate your agent development? Start for free below.