If you are completely new to product-specific LLM evals (not foundation model benchmarks), see these posts: part 1 , part 2 and part 3 . Otherwise, keep reading. Your AI Product Needs Eval (Evaluation Systems) Contents: Motivation Iterating Quickly == Success Case Study: Lucy, A Real Estate AI Assistant The Types Of Evaluation Level 1: Unit Tests Level 2: Human & Model Eval Level 3: A/B Testing Evaluating RAG Eval Systems Unlock Superpowers For Free Fine-Tuning Data Synthesis & Curation Debugging Creating a LLM-as-a-Judge That Drives Business Results Contents: The Problem: AI Teams Are Drowning in Data Step 1: Find The Principal Domain Expert Step 2: Create a Dataset Step 3: Direct The Domain Expert to Make Pass/Fail Judgments with Critiques Step 4: Fix Errors Step 5: Build Your LLM as A Judge, Iteratively Step 6: Perform Error Analysis Step 7: Create More Specialized LLM Judges (if needed) Recap of Critique Shadowing Resources A Field Guide to Rapidly Improving AI Products Contents: How error analysis consistently reveals the highest-ROI improvements Why a simple data viewer is your most important AI investment How to empower domain experts (not just engineers) to improve your AI Why synthetic data is more effective than you think How to maintain trust in your evaluation system Why your AI roadmap should count experiments, not features

Why do you recommend binary (pass/fail) evaluations instead of 1-5 ratings (Likert scales)?

Engineers often believe that Likert scales (1-5 ratings) provide more information than binary evaluations, allowing them to track gradual improvements. However, this added complexity often creates more problems than it solves in practice. Binary evaluations force clearer thinking and more consistent labeling. Likert scales introduce significant challenges: the difference between adjacent points (like 3 vs 4) is subjective and inconsistent across annotators, detecting statistical differences requires larger sample sizes, and annotators often default to middle values to avoid making hard decisions. Having binary options forces people to make a decision rather than hiding uncertainty in middle values. Binary decisions are also faster to make during error analysis - you don’t waste time debating whether something is a 3 or 4. For tracking gradual improvements, consider measuring specific sub-components with their own binary checks rather than using a scale. For example, instead of rating factual accuracy 1-5, you could track “4 out of 5 expected facts included” as separate binary checks. This preserves the ability to measure progress while maintaining clear, objective criteria. Start with binary labels to understand what ‘bad’ looks like. Numeric labels are advanced and usually not necessary.

LLM Evals: Everything You Need to Know

Q: What’s a minimum viable evaluation setup?

Start with error analysis , not infrastructure. Spend 30 minutes manually reviewing 20-50 LLM outputs whenever you make significant changes. Use one domain expert who understands your users as your quality decision maker (a “ benevolent dictator ”). If possible, use notebooks to help you review traces and analyze data. In our opinion, this is the single most effective tool for evals because you can write arbitrary code, visualize data, and iterate quickly. You can even build your own custom annotation interface right inside notebooks, as shown in this video .

LLMs

evals

A comprehensive guide to LLM evals, drawn from questions asked in our popular course on AI Evals. Covers everything from basic to advanced topics.

Authors

Hamel Husain

Shreya Shankar

Published

January 2, 2026

This document curates the most common questions Shreya and I received while teaching 700+ engineers & PMs AI Evals. Warning: These are sharp opinions about what works in most cases. They are not universal truths. Use your judgment.

👉 Want to learn more about AI Evals? Check out our AI Evals course. It’s a live cohort with hands on exercises and office hours. Here is a 25% discount code for readers. 👈

Listen to the audio version of this FAQ

If you prefer to listen to the audio version (narrated by AI), you can play it here.

Hamel Husain · LLM Evals FAQ

Getting Started & Fundamentals

Q: What are LLM Evals?

If you are completely new to product-specific LLM evals (not foundation model benchmarks), see these posts: part 1, part 2 and part 3. Otherwise, keep reading.

Your AI Product Needs Eval (Evaluation Systems)

Contents:

Motivation
Iterating Quickly == Success
Case Study: Lucy, A Real Estate AI Assistant
The Types Of Evaluation
1. Level 1: Unit Tests
2. Level 2: Human & Model Eval
3. Level 3: A/B Testing
4. Evaluating RAG
Eval Systems Unlock Superpowers For Free
1. Fine-Tuning
2. Data Synthesis & Curation
3. Debugging

Creating a LLM-as-a-Judge That Drives Business Results

Contents:

The Problem: AI Teams Are Drowning in Data
Step 1: Find The Principal Domain Expert
Step 2: Create a Dataset
Step 3: Direct The Domain Expert to Make Pass/Fail Judgments with Critiques
Step 4: Fix Errors
Step 5: Build Your LLM as A Judge, Iteratively
Step 6: Perform Error Analysis
Step 7: Create More Specialized LLM Judges (if needed)
Recap of Critique Shadowing
Resources

A Field Guide to Rapidly Improving AI Products

Contents:

How error analysis consistently reveals the highest-ROI improvements
Why a simple data viewer is your most important AI investment
How to empower domain experts (not just engineers) to improve your AI
Why synthetic data is more effective than you think
How to maintain trust in your evaluation system
Why your AI roadmap should count experiments, not features

Listen to the audio version of this FAQ

Getting Started & Fundamentals

Q: What are LLM Evals?

Q: What is a trace?

Q: What’s a minimum viable evaluation setup?

Q: How much of my development budget should I allocate to evals?

Q: Will today’s evaluation methods still be relevant in 5-10 years given how fast AI is changing?

Q: How do I make the case for investing in evaluations to my team?

Error Analysis & Data Collection

Q: Why is "error analysis" so important in LLM evals, and how is it performed?

1. Creating a Dataset

2. Open Coding

3. Axial Coding

4. Iterative Refinement

Q: How do I surface problematic traces for review beyond user feedback?

Start with random sampling

Use evals for initial screening

Leverage efficient sampling strategies

Q: How often should I re-run error analysis on my production system?

Q: What is the best approach for generating synthetic data?

Generation approaches

Q: Are there scenarios where synthetic data may not be reliable?

Q: How do I approach evaluation when my system handles diverse user queries?

Q: How can I efficiently sample production traces for review?

Evaluation Design & Methodology

Q: Why do you recommend binary (pass/fail) evaluations instead of 1-5 ratings (Likert scales)?

Q: Should I practice eval-driven development?

Q: Should I build automated evaluators for every failure mode I find?

Q: Should I use "ready-to-use" evaluation metrics?

Q: Are similarity metrics (BERTScore, ROUGE, etc.) useful for evaluating LLM outputs?

Q: Can I use the same model for both the main task and evaluation?

Q: How do we evaluate a model’s ability to express uncertainty or "know what it doesn’t know"?

Human Annotation & Process

Q: How many people should annotate my LLM outputs?

Q: Should product managers and engineers collaborate on error analysis? How?

Q: Should I outsource annotation & labeling to a third party?

The Dangers of Outsourcing

The Recommended Approach: Build Internal Capability

How to Handle Capacity Constraints

Exceptions for External Help

Q: What parts of evals can be automated with LLMs?

Here are some areas where LLMs can help:

However, you shouldn’t outsource these activities to an LLM:

Q: Should I stop writing prompts manually in favor of automated tools?

Tools & Infrastructure

Q: Should I build a custom annotation tool or use something off-the-shelf?

Q: What makes a good custom interface for reviewing LLM outputs?

1. Render Traces Intelligently, Not Generically:

2. Show Progress and Support Keyboard Navigation:

3. Trace navigation through clustering, filtering, and search:

4. Prioritize labeling traces you think might be problematic:

General Principle: Keep it minimal

Q: What gaps in eval tooling should I be prepared to fill myself?

1. Error Analysis and Pattern Discovery

2. AI-Powered Assistance Throughout the Workflow

3. Custom Evaluators Over Generic Metrics

4. APIs That Support Custom Annotation Apps

Q: What’s your favorite eval vendor?

Q: How should I version and manage prompts?

Production & Deployment

Q: How are evaluations used differently in CI/CD vs. monitoring production?

Q: What’s the difference between guardrails & evaluators?

Q: Can my evaluators also be used to automatically fix or correct outputs in production?

Q: How much time should I spend on model selection?

Domain-Specific Applications

Q: Is RAG dead?

Q: How should I approach evaluating my RAG system?

Q: How do I choose the right chunk size for my document processing tasks?

1. Fixed-Output Tasks → Large Chunks

2. Expansive-Output Tasks → Smaller Chunks

General Guidance

Q: How do I debug multi-turn conversation traces?

Multi-agent trace logging

Annotation strategy

Simplify when possible

Test case generation

Q: How do I evaluate sessions with human handoffs?

Q: How do I evaluate complex multi-step workflows?

Q: How do I evaluate agentic workflows?

Footnotes