Judge or Jury: What’s the right approach for LLM evaluation?

Maurice

November 19, 2024

LLM-based evaluations are rapidly becoming the de facto method for evaluating LLM systems.

Most engineering teams who build rigorous AI evaluations today use a single prompted LLM – typically one of the large proprietary models – to run their evals in the “LLM-as-a-judge” framework. But there’s been increasing interest in using a panel of smaller, fine-tuned models to evaluate different aspects of AI-generated content – “LLM-as-a-jury”.

What’s the best approach?

We’ll admit, we have a dog in this fight: We opted to train one large general purpose evaluator, rather than a jury of smaller models, and we’ll get to why later. But the jury is still out on whether we’re right (pun intended). Let’s take a look at both approaches, explore the benefits and drawbacks, and why you might consider using one over the other, depending on your use case.

LLM-as-a-Judge

This involves using a single, LLM to evaluate AI-generated content. It’s popular because it’s easier to set up and highly flexible. With few-shot prompting, you can quickly tailor the model to evaluate whatever criteria you’re interested in on the fly.

But it does have drawbacks. Large models are more resource-intensive and tend to have slower inference times, which makes them less suitable as guardrails in latency sensitive applications. And, as your application scales, using a large model to monitor much or all of your traffic might be prohibitively expensive.

LLMs-as-Juries

The jury approach consists of multiple smaller, specialized models, typically each one evaluating a specific criterion. This can be a good approach for capturing narrow criteria – and can be a faster and cheaper way of doing it. Also, ensembling multiple models can, in theory, reduce bias by offering a more diverse set of evaluative perspectives.

In some ways, this approach is nothing new. Ensemble learning, where you combine multiple algorithms to improve the accuracy and reliability of predictions, has long been an approach in machine learning. The winning team of the 2009 Netflix Prize – offering $1m to whoever improved the accuracy of Netflix’s recommendation engine by 10% – used an ensemble of over 100 different algorithms. Yet Netflix never implemented their approach, because “the additional accuracy gains that we measured did not seem to justify the engineering effort needed to bring them into a production environment”. More on this later.

The downside is that these small models don’t have the reasoning capabilities that the larger models have, so their critiques can be misleading. Building the tooling necessary to call multiple models is complicated, and small models can be unwieldy when trying to get them to return a specific format – somewhat diluting the potential benefits of using this approach.

Judge v jury: Weighing cost, performance, and bias

Cohere published a paper earlier this year, testing a panel of LLM evaluators across three judge settings and six different datasets. They found that this jury approach outperforms a single large judge, exhibits less bias, and is less expensive. Let’s examine these claims.

Cost

At the time the paper was published, cost was likely a more compelling factor in favor of the jury approach, but the cost of inference has dropped significantly since then. While the latest largest models, like o1-preview, remain relatively expensive – $15 per million input tokens and $60 per million output tokens – the gap in cost-per-token between smaller, specialist models and powerful, general-purpose models is shrinking. And as the cost of inference continues to drop, the cost savings alone might not justify the added complexity of a jury of evaluators for many applications.

Source: https://arxiv.org/pdf/2404.18796

Performance

The Cohere paper also claims that a panel of LLMs outperforms a single large judge, but I’m skeptical of the results. The authors struggled to get GPT-4 to do well on the benchmarks, and the model displayed some counterintuitive behaviors – I was surprised to see GPT-3.5-turbo doing better than GPT-4! Also, it’s questionable whether a jury of smaller models would outperform a single large judge when evaluating more complex or nuanced content.

Bias

Finally, the paper found reduced evaluation bias when using a panel of LLMs. This makes sense, because you essentially average out the bias or over-fitting of any particular model in your panel. But this effect assumes that the smaller models are trained on diverse data distributions, which isn’t always the case. If your panel is trained on similar data distributions, this bias-mitigation benefit won’t be as strong. So panel selection is going to have a big effect.

Implementation challenges

Regardless of the theoretical advantages, the biggest practical hurdle for a jury evaluation system is coordinating multiple models. Setting up and aligning multiple models to specific evaluation criteria, again and again, can be a lot of overhead.

We know, because we’ve been there. For our first customers, we fine-tuned models for specific use cases, similar to the jury approach. Our evaluators performed well, but customer needs evolve. By the time we had built and aligned a model to one evaluation task, customers had often shifted their focus to some other feature or new criteria.

That’s why we’re building one large, general purpose evaluator. Users can configure it on the fly through prompts, depending on the product and criteria they’re evaluating.

This becomes harder with the jury approach, because aligning one model to a task doesn’t guarantee consistency across the others. Each model can respond differently to the same prompt, requiring you to individually engineer prompts for, say, three different models, rather than just getting a single more capable one aligned. And this effect is exacerbated if you use different model families.

The jury’s not out

That said, there are applications where I can see an advantage for a panel of smaller models.

During development, the LLM-as-judge model shines. It gives you the flexibility to experiment with new criteria by tweaking prompts without needing to re-align multiple models. And latency is less of a concern in this phase, making the slower inference speed of a large model an acceptable trade-off for its versatility.

In production, however, latency becomes more important – especially in applications like chatbots, where real-time interactions are key. Here, a jury of smaller, narrowly focused models could work well as guardrails at inference. These models could quickly filter out the worst AI responses or ensure compliance with specific rules, prioritizing speed over nuanced reasoning.

Scaling for the future

But as AI applications become more complex, the value of a single judge that is capable of generalized reasoning grows. The future of LLM evaluation – I’m thinking particularly of agentic systems – will involve interactive, back-and-forth assessments, rather than zero-shot prompts, which will require models with deeper reasoning abilities.

In this context, vertical scaling – that is, scaling the performance of one model, rather than scaling horizontally, by using a larger panel of smaller models – makes more sense. A state-of-the-art evaluation model is going to out-reason many smaller models that can’t reason as well. To put it crudely: would you rather have one great mathematician think deeply about a problem for an hour, or several good mathematicians answer in five minutes and take an average? For complex tasks, depth of reasoning often wins.

The jury approach has its applications, and I’m interested to see what other use cases open up, but I think a general-purpose evaluator is the more robust, scalable solution for most development and production needs. And as the cost of inference continues to drop, the advantages of a single, comprehensive evaluator will only become more pronounced.