Use Selene with Langwatch’s Evaluation Wizard

Atla team

May 6, 2025

Today we’re announcing our integration with LangWatch, an all-in-one LLMOps platform. This integration allows developers to run Atla’s evaluation model Selene 1 as an “LLM-as-a-Judge” within LangWatch’s Evaluation Wizard feature. We’re excited to bring Selene to a wider community of AI developers through LangWatch’s platform!

LangWatch empowers AI teams to monitor, evaluate, and optimize LLM performance. Their Evaluation Wizard feature is particularly powerful, allowing developers to run evaluations that fit their needs: whether offline or real-time, against expected answers, with default RAG evaluators, or with an “LLM-as-a-Judge.”

When are “LLM-as-a-Judge” evaluations most useful? ‍

Automated LLM-as-a-Judge evaluations use one language model to evaluate and score the outputs of another. This method is commonly integrated into evaluation workflows to efficiently score large volumes of LLM outputs. They also allow developers to customize their evaluation metric by specifying an evaluation prompt.

Using dedicated evaluation models, like Selene, as an LLM Judge can offer a significant benefits like increased accuracy. Selene, for example, outperforms frontier models–including OpenAI’s o-series and Anthropic’s Claude–across 11 commonly used benchmarks for evaluators.

Running evals with Selene is free for LangWatch users during the next two months. Sign up to LangWatch to get started.

Use cases

You can use Selene as an LLM Judge in LangWatch to run offline evaluations over datasets, or to run real-time evaluations for live monitoring or as a guardrail. We provide demo videos for both use cases below. To learn more about these features, head to LangWatch’s docs.

1. Offline evals over a dataset

Let’s say we have a general purpose chatbot. We generated a test dataset of sample questions, then generated answers using a model (here, GPT 4o-mini) and a simple system prompt.

We evaluate the answers using a score evaluator. We select Atla Selene as our LLM Judge, and copy-paste the following evaluation prompt to evaluate for “helpfulness,” since we want our chatbot to be helpful. You can find more evaluation prompts by signing up for Atla and logging in to the Eval Copilot.

Evaluate how helpful the response is to address the user query.
Score 1: The response is not at all useful, failing to address the instruction or provide any valuable information.
Score 2: The response has minimal usefulness, addressing the instruction only superficially or providing mostly irrelevant information.
Score 3: The response is moderately useful, addressing some aspects of the instruction effectively but lacking in others.
Score 4: The response is very useful, effectively addressing most aspects of the instruction and providing valuable information.
Score 5: The response is exceptionally useful, fully addressing the instruction and providing highly valuable information.

As a result, we see how Selene scores and critiques each of the answers. You can then use these critiques to improve your system prompt and improve the helpfulness of your chatbot!

2. Real-time evals as guardrails

Let’s say we have a customer support chatbot. We uploaded a sample dataset of customer support questions and answers for this example–though, normally, you can use this feature for real-time evals.

We evaluate the answers using a boolean (true/false) evaluator. We select Atla Selene as our LLM Judge, and copy-paste the following evaluation prompt to evaluate for “relevance,” since we want to have guardrails against irrelevant answers. If our chatbot presents an irrelevant answer, we want Selene to identify this as FALSE and we would not show our user the answer.

Evaluate if the response fulfills the requirements of the instruction by providing relevant information. This includes responding in accordance with the explicit and implicit purpose of given instruction. If the response fulfills the requirements, mark TRUE. If not, mark FALSE.

In this example, we see how Selene accurately identifies an irrelevant answer that talks about “healthcare” in response to a question about “gift wrapping.”