Frontier AI evaluation models

Get precise judgments on your AI app’s performance. Use the links below to get started with our open source LLM Judge models.
Selene Models

Explore the right size for your evaluation needs.

Selene 1
The best model for evaluation on the market.
Wide variety of eval tasks
Suitable for pre-production evals
Industry-leading accuracy
Selene 1 Mini
Lean version of Selene
Wide variety of eval tasks
Perfect for running evals at inference time
Optimised for speed

Use the Selene models through Hugging Face Transformers.

Example
Copy
from transformers import AutoModelForCausalLM, AutoTokenizer

device = "cuda" # the device to load the model onto
model_id = "AtlaAI/Selene-1-Mini-Llama-3.1-8B"
model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto")
tokenizer = AutoTokenizer.from_pretrained(model_id)

prompt = "I heard you can evaluate my responses?" # replace with your eval prompt

messages = [{"role": "user", "content": prompt}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
model_inputs = tokenizer([text], return_tensors="pt").to(device)
generated_ids = model.generate(model_inputs.input_ids, max_new_tokens=512, do_sample=True)
generated_ids = [output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)]

response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]

Run evals with
our
LLM-as-a-Judge

Need to build trust with customers that your generative AI app is reliable? Judge your AI responses 
with our evaluation models and receive scores and actionable critiques.

Building agents?
Evaluate them with Atla

Install the Atla package
Track your agents
Understand errors instantly
Book a demo