Q: What are LLM Evals?

LLMs
evals
faq
faq-individual
Published

July 28, 2025

If you are completely new to product-specific LLM evals (not foundation model benchmarks), see these posts: part 1, part 2 and part 3. Otherwise, keep reading.

Your AI Product Needs Eval (Evaluation Systems)

Contents:

  1. Motivation
  2. Iterating Quickly == Success
  3. Case Study: Lucy, A Real Estate AI Assistant
  4. The Types Of Evaluation
    1. Level 1: Unit Tests
    2. Level 2: Human & Model Eval
    3. Level 3: A/B Testing
    4. Evaluating RAG
  5. Eval Systems Unlock Superpowers For Free
    1. Fine-Tuning
    2. Data Synthesis & Curation
    3. Debugging

Creating a LLM-as-a-Judge That Drives Business Results

Contents:

  1. The Problem: AI Teams Are Drowning in Data
  2. Step 1: Find The Principal Domain Expert
  3. Step 2: Create a Dataset
  4. Step 3: Direct The Domain Expert to Make Pass/Fail Judgments with Critiques
  5. Step 4: Fix Errors
  6. Step 5: Build Your LLM as A Judge, Iteratively
  7. Step 6: Perform Error Analysis
  8. Step 7: Create More Specialized LLM Judges (if needed)
  9. Recap of Critique Shadowing
  10. Resources

A Field Guide to Rapidly Improving AI Products

Contents:

  1. How error analysis consistently reveals the highest-ROI improvements
  2. Why a simple data viewer is your most important AI investment
  3. How to empower domain experts (not just engineers) to improve your AI
  4. Why synthetic data is more effective than you think
  5. How to maintain trust in your evaluation system
  6. Why your AI roadmap should count experiments, not features

↩︎ Back to main FAQ


This article is part of our AI Evals FAQ, a collection of common questions (and answers) about LLM evaluation. View all FAQs or return to the homepage.