As LLM applications become more prevalent, ensuring reliability in AI outputs is critical. Enter Selene Mini, our state-of-the-art evaluation model that helps you evaluate AI outputs using your chosen criteria.
To get started with Selene Mini, use our cookbooks for these popular evaluation use-cases:
- RAG Hallucination detection: identify and measure hallucinations in your RAG pipeline
- Absolute Scoring: use a structured 1-5 scoring rubric to evaluate AI-generated responses
Below, we break down how you can use these cookbooks step by step.
Selene Mini’s capabilities
Selene Mini is the world’s best small language model-as-a-judge:
- Beats the top small models on eval tasks across 11 benchmarks
- Outperforms GPT-4o on RewardBench, EvalBiasBench, and Auto-J
- Excels in real-world domains, such as finance and medicine
- Customizable - prompt Selene Mini to fit the evaluation criteria for your specific use case
For more details, check out the Selene 1 Mini blog post.
🧑🍳 Detecting Hallucinations in Your RAG Pipeline
Retrieval-augmented generation (RAG) pipelines aim to generate responses based on retrieved documents, but hallucinations—when the model fabricates information—can undermine trust and accuracy.
In our cookbook, we check for the hallucination of AI responses i.e. 'Is the information provided in the response directly supported by the context given in the related passages?'
Example Use Case: Imagine you’re building a legal research chatbot that retrieves case law and summarizes rulings. You want to ensure the chatbot only generates claims supported by retrieved documents.
Step-by-Step Guide
- Set-up - Load the model, Selene Mini.
- Load the sample dataset. To mock this use-case, we evaluate over the public benchmark RAGTruth, a large-scale corpus of naturally generated hallucinations designed for RAG scenarios.
- Define your hallucination detection evaluation prompt. We use our ‘classification’ prompt template in the cookbook. See our prompt templates for other scoring formats.
 
- Run evals - Run the evals to get structured hallucination scores and critiques.
 
- Analyze the output - If the model hallucinates, Selene Mini highlights unsupported claims and explains why.
- Use these insights to adjust hyperparameters, prompts, or even consider experimenting with different LLMs.
 
Try it yourself: Follow our RAG Hallucination Cookbook.

👩🍳 Absolute Scoring: Evaluating AI Outputs on a 1-5 Scale
Developers often need a consistent way to score AI-generated responses beyond binary “good” or “bad” labels. A 1-5 scoring system allows fine-grained evaluation with explainability.
In our cookbook, we evaluate the completeness of AI responses i.e. 'Does the response provide a sufficient explanation?'
Example Use Case: Let’s say you run a customer service chatbot. You might need to evaluate responses based on completeness, but also on metrics you define like clarity and helpfulness.
Step-by-Step Guide
- Set-up - Load the model, Selene Mini.
- Load the sample dataset. To mock this use-case, we evaluate over the public benchmark FLASK, a dataset with human-annotated scores from 1-5.
- Define your completeness evaluation prompt. We use our ‘absolute scoring with reference’ prompt template in the cookbook. See our prompt templates for other scoring formats.
 
- Run evals - Run the evals to receive a structured numeric score (1-5), along with a critique explaining the reasoning behind the score.
 
- Use the Scores for Optimization - Identify weaknesses in AI responses.
- Fine-tune model behavior or retrain based on low-scoring outputs.
 
Try it yourself: Follow our Absolute Scoring Cookbook.

Suggested Next Steps
We’ve mocked either use-case with sample datasets. If these metrics suit your needs, we suggest implementing evals into production monitoring. Load your own RAG pipeline or dataset, tweak the prompt to your needs, and monitor the hallucinations/completeness of your application at a production level.
For questions or feedback, join our community on Discord!




.png)
