Evaluating GenAI applications with LLM‑as‑a‑judge

Kyle

October 8, 2024

Iterating Quickly == Success

The core principle of effective product development are fast iteration cycles. In the world of GenAI, however, traditional software development principles fall short. Unlike deterministic systems, language models operate in a probabilistic space, making it challenging to:

Develop effective unit tests
Define success metrics
Evaluate system-level improvements
Iterate quickly and effectively

The key to unlocking the full potential of language models lies in the ability to swiftly and accurately make informed decisions about system changes, such as updating your base model, your retriever, or your prompts. This accelerated feedback loop is what distinguishes products that generate millions in revenue from prototypes that never make it into production.

The Challenge of Evaluation

As AI applications are designed to interact with and serve real users, effective feedback should represent human preferences. While having domain experts review log data remains the gold standard for robust evaluations, this approach comes with challenges:

High Costs: Hiring experts to label data can be prohibitively expensive, especially in fields with highly skilled workers
Time Constraints: Human evaluations can significantly slow down iteration speed
Capability Gap: It is already challenging for non-experts to evaluate AI systems for accuracy, and for challenging tasks this difficulty can extend to domain experts. As AI continues to advance, this will likely get worse. We wrote a blog post on the near-future implications of this

LLM-as-a-Judge

To overcome these challenges, the ‘LLM-as-a-Judge’ framework incorporates an LLM into the evaluation workflow. Specifically, great AI-assisted evals help developers with:

Speed: Rapidly determine whether an update represents an improvement or a regression
Scale: Assess large volumes efficiently and compare specific examples vs. prior runs
Consistency: Apply uniform eval criteria across iterations ⇒ avoid playing whack-a-mole

Selecting the right model as your AI judge is crucial. Here are three key factors to consider:

Evaluation Capabilities: Look for a model that closely matches your human labels. Most developers start by using a strong general purpose model as their AI judge. For example, using GPT-4o, or Claude 3.5-Sonnet can be a good starting point. There is also a growing collection of evaluation-specific models
Reproducibility: To avoid optimizing towards a moving target, you will want to get the same evaluation outcomes over the same test samples. During our tests, GPT-4 / GPT-4-Turbo yielded consistent results only about 45% of the time, whereas Claude 3.5-Sonnet was more reliable, with 70% reproducibility
Rate-limits: Often, you will want to run large batches of evaluations. Depending on your rate limits, you might not want to use the same model that you use in production

At Atla, we are building frontier evaluation-specific models to act as AI judges. We are designing our models with these considerations in mind. Please sign up here to access to our most powerful models.

Making your LLM evals great

Identify success metrics: Examine your outputs or have domain experts do this to determine what will make your product great. You might start with standard metrics like hallucination, recall etc. before moving onto custom metrics addressing your use case.
Align your evals: Bring the model-based evaluator in line with your judgements. This meta-evaluation step is crucial to ensure the quality of your model-based evals. To achieve this, we can:
- Test various eval prompts
- Fine-tune the eval metric with challenging few-shot examples
- Validate the alignment by measuring agreement between annotators (e.g., Cohen's Kappa)

We’ve built an AI judge alignment app to support this. It is currently in private beta. Please get in touch if you think this would be helpful for your team.

Improving your application

With a reliable quality indicator in place, you can now iterate on various aspects of your GenAI application, e.g. experimenting with different LLMs, testing different prompts, or assessing your RAG setup.

Pro tip: If you find yourself limited by prompt engineering, consider synthetically generating data and using the scores from your AI judge to filter out low-quality samples. This process involves creating a large volume of synthetic examples that cover a wide range of potential inputs and desired outputs for your application. You can then use your calibrated AI evaluator to score each example, keeping only those that meet a certain pre-defined quality threshold. The resulting high-quality dataset can be used to fine-tune your model and improve the responses, e.g. in aligning to a desired syntax / grammatical style.

‍

Some takeaways

While model-based evals are powerful, they shouldn't completely replace human judgment. Use them to augment and scale human insights, not replace them entirely
Chose and align your AI judge carefully. The closer the agreement between your human evaluations and your AI-assisted evals, the happier your users will be.
Your evaluation framework should evolve alongside your AI. Regularly revisit and refine your LLM judge's prompts and criteria