Processing millions of product queries monthly, JOSEPHA helps users find the same product cheaper across different retailers. But their sophisticated multi-agent system would silently fail in production, creating patterns that were nearly impossible to detect manually.
Background
JOSEPHA revolutionizes price comparison by helping users find the same product cheaper across different retailers. Instead of presenting users with ad-heavy search results and blue links, their agents automatically scour the internet to aggregate all relevant product information: video tutorials, test reviews, price comparisons across retailers, and detailed specifications—all focused on helping users make informed purchasing decisions and find the best deals.
Their system operates through a sophisticated multi-stage pipeline:
- Specialized agents analyze user requests and identify relevant products across the web
- Multiple agents work in parallel to find and match better offers for comparison
- Dedicated agents gather comprehensive product information from multiple retailer sources
- Verification agents ensure correctness of information
This architecture processes approximately 70K product queries daily, extracting pricing, specifications, and availability data from retailer websites across multiple languages and markets.
The Challenge
As JOSEPHA's multi-agent system processed millions of product queries monthly, the team faced critical visibility gaps that threatened their core value proposition:
1) Production failures buried in overwhelming scale:
With millions of execution traces generated monthly across their multi-stage pipeline, manually analyzing agent behavior was practically impossible. Despite extensive testing in development, the team lacked visibility into how their agents performed against real-world edge cases on diverse retailer websites.
2) Unable to prioritize fixes based on user impact:
Without systematic error detection, the team couldn't distinguish between one-off failures and patterns affecting thousands of users. This made it impossible to prioritize which agent improvements would have the highest impact on helping users find products cheaper—their core mission.
The Result: Three Critical Failure Patterns Discovered
Within hours of integrating Atla into their production workflow, three distinct failure patterns surfaced—each directly affecting JOSEPHA's core value proposition of helping users find identical products at lower prices.
Pattern 1: Numerical Computation Failures
The Issue: The extraction agent would consistently skip VAT calculations despite explicit instructions. On sites showing only net prices and the note “excluding VAT”, the system extracted the base price but failed to add the required 19% VAT. As a result, products were compared on inconsistent bases (net vs. gross), producing misleading results and undermining JOSEPHA’s promise to help users find the same product cheaper.
Atla's Recommendations:
- Refine the instruction to explicitly state: 'The
price
must always be the gross price, including VAT. If only the net price is available, you must calculate the gross price using a 19% VAT rate' - Consider implementing a dedicated
vat_calculator
function or tool that takesnet_price
andvat_rate
as parameters and returns the gross price. The system should be explicitly instructed or constrained to call this tool when a VAT-inclusive price is required and only a net price is found - Implement a
vat_rate_search()
function that takes as input the product category and the country, and returns the VAT percentage
The Impact: Users received non-standardized price information, making accurate cross-retailer comparisons difficult.
Pattern 2: Price Extraction Failures
The Issue: Beyond calculation errors, the system would sometimes fail to extract any price information, leaving price and currency fields empty even when this data was available in the webpage content.
Digging into the highlighted traces, this occurred most often on pages with complex pricing layouts or non-standard price formats, but the pattern was consistent enough to affect a meaningful portion of product searches.
Atla's Recommendation:
- Consider adding a fallback step: if no price is found, trigger a separate extraction focused solely on the price
The Impact: Products appeared without pricing information, forcing users to manually visit retailer websites—directly undermining JOSEPHA's value proposition.
Atla Tip: Add these examples to your eval dataset for the web data extraction microservice and refine your prompt to ensure correct price extraction.
Pattern 3: Neglecting Product Discounts
The Issue: When products displayed both original and discounted prices, the extraction agent would often capture the crossed-out original price instead of the current selling price. This occurred because website content had been stripped of HTML structure during preprocessing, leaving the agent with two undifferentiated numbers and no signal to identify which one reflected the true selling price.
This pattern was particularly problematic during promotional periods where accurate sale price extraction directly impacts user purchasing decisions.
Atla's Recommendations:
- Preserve HTML tags so the agent can distinguish between original and sale prices
- Provide few-shot examples that explicitly demonstrate how to identify and prioritize the current selling price when both an original and discounted price are present on a webpage
- Update the prompt to clearly instruct that the lowest visible price, typically bolded or prominently displayed, should be selected as the current selling price, especially when multiple prices are shown
The Impact: Users saw inflated pricing that could cause them to miss genuine deals or make decisions based on outdated information.
Atla's comparison feature allowed them to validate improvements by examining agent behavior before and after prompt refinements, ensuring that fixes for one pattern don’t inadvertently create new failure modes elsewhere in their system.
Ready to accelerate your agent development? Start for free below.