Frontier AI evaluation models

Get precise judgments on your AI app’s performance. Use the links below to get started with our open source LLM Judge models.

Selene 1

The best model for evaluation on the market.

Wide variety of eval tasks

Suitable for pre-production evals

Industry-leading accuracy

Use Selene 1

Selene 1 Mini

Lean version of Selene

Wide variety of eval tasks

Perfect for running evals at inference time

Optimised for speed

Use Selene 1 Mini

Use the Selene models through Hugging Face Transformers.

Example

Copy

from transformers import AutoModelForCausalLM, AutoTokenizer

device = "cuda" # the device to load the model onto
model_id = "AtlaAI/Selene-1-Mini-Llama-3.1-8B"
model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto")
tokenizer = AutoTokenizer.from_pretrained(model_id)

prompt = "I heard you can evaluate my responses?" # replace with your eval prompt

messages = [{"role": "user", "content": prompt}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
model_inputs = tokenizer([text], return_tensors="pt").to(device)
generated_ids = model.generate(model_inputs.input_ids, max_new_tokens=512, do_sample=True)
generated_ids = [output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)]

response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]