Motivation
How do we align AI systems on tasks that are difficult for humans to evaluate?
AIs will probably become very powerful, and perform tasks that are difficult for us to verify. In which case, we would like to have a way that allows us to verify their work. To achieve this, we need to enable humans to oversee AI systems that are solving tasks too difficult for a single human to evaluate.
This need became apparent when we developed our first GenAI application as part of Y Combinator last year. Our AI-generated responses were plausible, yet even domain experts found them hard to verify for accuracy.
Using an ‘LLM-as-a-Judge' has become a common method to have one large language model (LLM) evaluate another’s output. This approach is useful but raises a new, interesting problem. You now have to do another round of aligning the evaluating LLM, which involves careful prompt engineering and verification: We call this meta-evaluation.
In this post we will talk about training AI evaluators, and some motivating research that led us to our solution. If you want to use our evaluation models to run more accurate evals of your GenAI app, you can sign up to Atla for free here.
Training an LLM evaluator vs. training an LLM
When debugging models, it helps to understand the evaluator's reasoning. Knowing why an output meets certain criteria helps engineers understand how their AI fails. Thus, AI evaluators should provide both scores and free-text critiques in their evaluations. This requires a robust integration of both language generation and classification/regression capabilities. This dual focus makes the training of LLM evaluators unique.
We implement two main differences when training an LLM evaluator compared to a standard transformer-based language model:
- Auxiliary heads: A 'head' refers to one or more final layers in the architecture that are specialized for a particular task. Decoder-only transformers are trained to predict the next token in a sequence. Standard LLM training focuses on minimizing the loss associated with predicting these tokens accurately. However, when training an LLM evaluator, the model is equipped not only with the language model heads but also with classification or regression heads (i.e. auxiliary heads). To reach higher agreement between our AI evaluator’s scores and those given by human evaluators, we train evaluators on a compound loss that factors in the individual losses from both heads. The proportion of each loss is determined by a hyperparameter. Below is an illustration of our evaluator’s architecture:
- Training data curation: Evaluator models are heavily trained on 'preference data' which includes annotations of human preferences (e.g. upvotes). This data helps the model learn to prioritize outputs that align with human judgments. As compared to training a general purpose model, our evaluation models are over proportionally trained on text data that include an indication of human preference.
Case Study: Elicit, Aligning an AI Research Assistant
A few months ago, we started working with the team at Elicit to improve their AI research assistant. Our goal was to enhance the accuracy and reliability of Elicit’s ‘paragraph summary’ and ‘chat with papers’ tasks, two crucial components of their application.
We began with a series of evaluations comparing our baseline model with OpenAI’s GPT-4, which had been carefully prompt engineered to act as an evaluator of Elicit’s AI generated answers. At the start, the baseline performance of our model was inferior to GPT-4, reaching only slight agreement with the human annotators (0.14), whereas GPT-4 showed at least fair agreement with humans (0.30).
Our research team improved the training data by generating high-quality synthetic datasets for our evaluation model. We employed a multi-stage process to generate and filter this synthetic data, investing heavily in tooling to ensure it matched the original distribution. This approach significantly enhanced performance, nearly matching human raters.
The table below shows the alignment between our baseline model, Atla-1, GPT-4, and Human annotators in terms of inter-rater reliability measured using Cohen’s Kappa on Elicit’s ‘paragraph summary’ task.
Cohen's Kappa ranges from -1 to 1:
- Values < 0 indicate no agreement
- 0 indicates agreement by chance
- 0.01 to 0.20 indicates slight agreement
- 0.21 to 0.40 indicates fair agreement
- 0.41 to 0.60 indicates moderate agreement
- 0.61 to 0.80 indicates substantial agreement
- 0.81 to 1.00 indicates almost perfect agreement
Preparing for Advanced AI: Bootstrapping Evaluations
Our work on AI evaluators aims to contribute to a safer future with advanced AI. As models get more powerful, they will handle tasks which humans cannot understand or evaluate easily. We believe AI evaluators will play a key role in ensuring future AI generations work as intended.
There are two streams of motivating research that led us in this direction:
A recent paper from UC Berkeley by Shreya Shankar, titled "Who Validates the Validators? Aligning LLM-Assisted Evaluation of LLM Outputs with Human Preferences," tackles a problem similar to ours and explores improving the reliability of LLM evaluators. The authors suggest a mixed-initiative approach, involving both automated suggestions and human input, ensuring that the evaluation criteria are well-aligned with human expectations.
The second is from OpenAI’s Superalignment team by Collin Burns: Weak-to-Strong Generalization: Eliciting Strong Capabilities with Weak Supervision. The paper explores how weak model supervision can enhance the performance of stronger models, finding that techniques such as auxiliary confidence loss and bootstrapping can significantly improve generalization, highlighting potential solutions for aligning superhuman AI models with human oversight.
As AI takes on more roles in critical areas, having evaluators that can operate with minimal human guidance while still maintaining alignment will be essential. We're building these systems now to ensure that as AI becomes more powerful, it remains safe and beneficial, guided by evaluations we can trust.
Conclusion
As AI advances, automated evaluators will be crucial in bridging the gap between machine capabilities and human expectations. ‘LLM-as-a-Judge’ evaluators are powerful tools for assessing generative AI systems but pose new challenges in prompt engineering and human preference alignment. Atla’s evaluators provide an elegant solution to this problem, leveraging an improved model architecture, training algorithms, and data curation to empower more accurate AI-assisted evaluations. With Atla, we're enabling teams to build, assess, and refine their AI applications with greater confidence and efficiency. If you haven’t already - sign up for Atla for free here.