“It’s Hard to Eval” Is a Product Smell

LLMs

evals

Your product should make it easy to verify AI outputs.

Author

Hamel Husain

Published

June 29, 2026

For the past 3 years, AI evals have been my professional focus.¹ The most common objection I hear to evals is “our product is hard to eval”.

This objection is a product smell. Artifacts that are hard for you to verify are often hard for users too. In the worst case, users have to redo the work from scratch to verify the output. More importantly, designing your product for ease of verification should come before building evals.

In this post, I’ll walk through three products I advised on that faced this issue. I’ll also show before and after sketches to demonstrate design principles. After these examples, I’ll discuss how to apply this general pattern to your product.

Example 1: the AI data agent

Almost every company I’ve worked with builds an internal AI data agent. You ask it a business question, like what was net revenue for Product A last quarter, and it finds relevant data sources, runs the queries, and provides an answer. The goal of this agent is to reduce dependency on data analysts.

A common mistake when building AI data agents is to make the answer the only output, as illustrated below.

Data Agent

What was net revenue for Product A last quarter?

Net revenue for Product A last quarter was $4.21M.

Ask anything about your business…➤

Since the only output is the answer, there is nothing here to check.

In the sketch above, the user has no way to verify the answer beyond redoing work.² A better design is to provide the user with checkable artifacts, informed by how a domain expert might validate the output. Here are techniques I use to validate metrics as a data scientist:

Compare the quantity and any intermediate calculations against a trusted source, like a vetted dashboard or report, or a similar analysis a colleague has already vetted.³
Confirm the metric definition precisely. A number like net revenue can include or exclude things like returns and discounts.
Sanity-check a related quantity. If I can’t verify the number directly, I pull a related number that should move with it, like units sold or unique customers, and check if the combination is plausible.
Look at what is beneath the aggregate. A total can hide problems, so I break it down by dimensions like region or time period and sanity-check the distribution.
Read the query. For an important number I look at the SQL to confirm it does what I think, and I tweak it and rerun to test my assumptions.
Note anything I could not verify. If a step has no trusted reference to check it against, I flag it instead of presenting it as settled.

Here’s what a better interface might look like. The two tabs below show the same answer at two levels of detail. The chat reply surfaces the details worth seeing up front, and the notebook holds the full analysis behind the answer. Use the tabs to switch between them.

Data Agent

What was net revenue for Product A last quarter?

Net revenue for Product A last quarter was $4.21M.ⓘ

Details

Adapted from “Quarterly revenue review”authored by Priya Nair · Jun 28, 2026 Open ↗

Assumptions

Metric definitiongross − returns − discounts matches governed definition

Intermediate calculations

Returns netted out−$0.7M no trusted sourceⓘ

Unique customers12,480 183 unmatchedⓘ, CRM ↔ Billing

Open as notebook

The agent surfaces the important details behind the answer. Select the Notebook tab above to see the full analysis generated by the agent.

This is the notebook the agent worked in while producing its answer. Scroll within the figure to see all the cells. The sidebar has a Contents tab for jumping between sections and an Assistant tab for asking follow-ups.

net_revenue_product_a.ipynb ▶ Run all ↑ Publish

◫ Adapted from Quarterly revenue review ↗, authored by Priya Nair · Jun 28, 2026

MarkdownBiH🔗☰

Net revenue — Product A, Q4 FY25

“What was net revenue for Product A last quarter?”

Short answer: $4.21M. This notebook shows how that number was built and what was checked against a trusted source.

1. Metric definition

Net revenue is gross − returns − discounts. Read the definition from the governed metrics layer so this matches what finance reports.

[1]

Python▶Code collapsed✦ Ask AI ⌘K

import yaml

defn = yaml.safe_load(open("metrics/net_revenue.yml"))
defn["expr"], defn["source"]

('gross - returns - discounts', 'finance.order_lines')

2. Net revenue for Product A

Pull net revenue for Product A for the quarter, straight from the order lines.

[2]

SQL▶Code collapsed✦ Ask AI ⌘K

SELECT SUM(gross - returns - discounts) AS net_revenue
FROM   finance.order_lines
WHERE  product = 'Product A'
  AND  fiscal_quarter = 'Q4-FY25';

net_revenue
4,210,442

That's the $4.21M the chat answer reported.

3. Sanity-check the distribution

Break Product A down by region, then plot it. Nothing should look out of place against last quarter's mix.

[3]

SQL▶Code collapsed✦ Ask AI ⌘K

SELECT region,
       SUM(gross - returns - discounts) AS net_revenue
FROM   finance.order_lines
WHERE  product = 'Product A'
  AND  fiscal_quarter = 'Q4-FY25'
GROUP BY region;

region	net_revenue
West	1,740,000
Central	1,160,000
East	890,000
Intl	420,000

[4]

Python▶Code collapsed✦ Ask AI ⌘K

m = by_region.set_index("region")["net_revenue"] / 1e6
m.plot.barh(title="Net revenue by region · Q4 FY25")

Net revenue by region · Q4 FY25

West1.74

Central1.16

East0.89

Intl0.42

00.51.01.52.0

net revenue ($M)

West and Central drive most of the revenue, with International a small tail.

4. What I could not verify

Two inputs have no trusted source to check against. Instead of treating them as settled, each is left below as a cell you can run and edit to dig in.

[5]

SQL▶Code collapsed✦ Ask AI ⌘K

-- the agent's -$0.7M returns figure is an estimate;
-- check the returns table for Q4 rows to back it
SELECT COUNT(*) AS n_rows, SUM(amount) AS returns
FROM   finance.returns
WHERE  product = 'Product A'
  AND  fiscal_quarter = 'Q4-FY25';

n_rows	returns
0	NULL

# Q4 returns haven't loaded yet, so the −$0.7M is an estimate

So that −$0.7M has no source to check against yet. The other open item is the customer join:

[6]

SQL▶Code collapsed✦ Ask AI ⌘K

-- 183 of 12,480 unique customers are in Billing, not CRM,
-- so some revenue can't be attributed. Pull them:
SELECT b.customer_id, b.amount
FROM   billing.invoices b
LEFT JOIN crm.customers c USING (customer_id)
WHERE  c.customer_id IS NULL
ORDER BY b.amount DESC
LIMIT 5;

customer_id	amount
BIL-44821	18,400
BIL-39105	12,950
BIL-50277	9,310
…	…

# 183 rows · $61,540 unattributed

That's $61,540 of revenue whose customer attribution is uncertain, surfaced so a person can resolve it before the total is trusted.

Contents ✦Assistant

New thread✕

What was net revenue for Product A last quarter?

✦

I pulled Product A's net revenue from the order lines using finance's governed definition, broke it out by region, and flagged what I couldn't verify. The cells are on the left.

✓ Generated the cells in this notebook

Ask a follow-up…

SQL [2]Auto ▾↑

The notebook reads top to bottom: the assumptions the agent made, the queries it ran, and an explicit list of what it could not verify.

There is a lot to unpack here. Here are notable changes:

The agent optionally performs retrieval from vetted analyses, and the interface shows which one was used along with who authored it.
There is progressive disclosure of details. The chat reply shows high value items like sources, assumptions, and issues. The user can optionally open an interactive notebook to see the full context.
The AI-generated notebook (see notebook tab above) is organized to promote verification: it opens with the assumptions the agent made, like the metric definition and where it came from, then shows the queries it ran and the numbers they returned. It breaks the total down so you can sanity-check the distribution, and it closes with a list of what it could not verify, each item left as a cell you can run.
The AI agent is also available in the notebook to help with follow-ups. Finally, the user can publish the notebook back to a knowledge base, where it can be retrieved by future analyses, creating a virtuous cycle.

This design sketch is far from perfect. The point is that the product should help the user verify the answer as a domain expert would. Compare it to the earlier approach, where the only output was the number.

Data agents like these are not science fiction. Hex⁴ is my favorite product in this genre; it integrates notebooks and chat better than anything I’ve seen. Here are screenshots from their landing page:

Notebook view which allows the user to see the intermediate steps and data.

But what does this have to do with evals? If you design your product for verification, annotation becomes less expensive and evals will have better signals to draw from. More importantly, you’ll provide your users with a better product.

Example 2: the PE curriculum builder

A founder I advised was building an AI tool that writes physical education lesson plans for K-12 teachers. A teacher enters their constraints, like the grade they teach, how long the class is, whether it meets indoors or outdoors, and what equipment they have. The tool then writes a lesson plan for those constraints. The goal is to save teachers the time they spend planning and give them a plan that fits their class. Here is a sketch of what the product looks like:

PE Planner

Class details

Grade Grade 4▾

Class length 45 minutes▾

Location Outdoors▾

Equipment 12 cones, 6 balls

Generate lesson plan

Warm-up · 8 min

Jog the perimeter, then dynamic stretches: arm circles, lunges, and high knees. Finish with a partner toss to warm up the hands.

Main activity · 30 min

Pair students and set cones 10 feet apart. Partners practice underhand throws, then overhand. Cue them to step with the opposite foot and follow through. Widen the gap as accuracy improves and rotate partners every 5 minutes.

Cool-down · 7 min

Static stretches and a quick recap of throwing cues: step, point, follow through.

Assessment · notes

Watch for a stepped throw with the opposite foot forward and eyes on the target. Note students who need a shorter throwing distance next class.

Equipment

12 cones, 6 foam balls, 3 station markers.

The teacher enters constraints and the tool writes a plan from scratch. The only output is the plan, so its difficult to verify.

The founder asked me how to eval the lesson plans. I turned the question around: what does a teacher care about?

The fastest way to trust a plan is to see that a teacher like them already uses it. Additionally, teachers value visibility into what others are doing so they can learn new approaches. Therefore, a better design might start from vetted lesson plans that are actively used in schools. When the tool generates a plan, it shows which vetted plan it started from, who uses that plan, and a diff of what it changed for this teacher’s constraints.

Next, the teacher can check a small set of changes against a plan they already trust, instead of judging a whole plan from scratch. Here’s a sketch of what a better interface might look like:

PE Planner

View original

DR Adapted from Dana Ruiz's plan

Grade 4 PE · Lincoln Elementary, Austin TX

Used at 14 schools Run 30+ times this year Aligned to SHAPE standards

Full plan

Warm-up · 8 min

1Jog 1 lap to the far cones and back, then dynamic stretches: arm circles, lunges, and high knees.

Finish with partner toss to warm up the hands, 10 throws each.

Main activity · 30 min

23 stations with cones as targets. Pairs throw underhand then overhand with 6 balls, rotating every 5 minutes.

Cue students to step with the opposite foot and follow through toward the target.

Widen the distance between partners as accuracy improves.

Cool-down · 7 min

Static stretches and a recap of throwing cues: step, point, follow through.

Assessment · notes

Watch for a stepped throw with the opposite foot forward and eyes on the target.

Note students who need a shorter throwing distance next class.

Equipment

12 cones, 6 foam balls, 3 station markers.

What changed · 2 editsAccept allReject all

1 Warm-up

−Jog 2 laps, then dynamic stretches (10 min)

+Jog 1 lap, then dynamic stretches (8 min)

Shortened to fit a 45-minute class

Accept ⌘⏎ Reject ⌘⌫

2 Main activity

−4 stations, 8 balls, hoops as targets

+3 stations, 6 balls, cones as targets

Matched to your equipment

Accept ⌘⏎ Reject ⌘⌫

The plan is anchored to a vetted plan another teacher uses. The changes are marked so the teacher can check a diff against a plan they trust.

In this version, most of the plan is inherited from a vetted plan. The teacher’s review is scoped to a few edits, each with a reason explaining why the change was made. This is a more efficient way to review a plan because it reduces the cognitive load of judging a whole plan from scratch.

Designing for this makes the product simpler to build. Instead of stuffing hundreds of examples into a prompt, the tool captures important dimensions, retrieves a close match, and adapts it. Automated evals now become tractable because there is less surface area to test. For example, you can verify that the plan retrieval picked a sensible anchor, and each edit honors the constraints.

Example 3: the workers-comp medical report

The last example comes from a workers’ compensation tool a founder asked me to help with. It reads a patient’s chart (intake forms, imaging reports, therapy notes, prior exams) and generates a long expert opinion report, often fifty pages or more. Here is a sketch of the product:

ClaimDraft –□✕

FileEditCaseReportViewHelp

Source records · 18

Intake questionnaire

MRI, lumbar spine

X-ray report

Physical therapy notes

Treating physician notes

Prior IME

Work restrictions

+ 11 more

Claimant J. DoeClaim #WC-20259417Date of injury Mar 3, 2025Examiner Dr. A. Patel

Independent Medical Evaluation

Permanent & Stationary Report

1. History of injury

The claimant is a 47-year-old warehouse associate who reports a lumbar spine injury on March 3, 2025 while lifting a carton estimated at sixty pounds. He describes immediate low back pain radiating into the right lower extremity, followed by numbness along the lateral calf.

2. Review of records

Records reviewed include the intake questionnaire, an MRI of the lumbar spine dated March 18, 2025, twelve physical therapy notes, and the treating physician's progress reports through August 2025.

3. Physical examination

On examination, lumbar flexion was limited to 40 degrees with pain. Straight leg raise was positive on the right at 50 degrees. Strength was 4 of 5 in the right extensor hallucis longus, with diminished sensation in the L5 distribution.

4. Diagnoses

Lumbar disc herniation at L5-S1 with associated right L5 radiculopathy, supported by the imaging and examination findings above.

5. Causation analysis

Within reasonable medical probability, the disc herniation is causally related to the industrial lifting event of March 3, 2025. The claimant had no documented history of lumbar treatment prior to that date.

Page 1 of 52Generated from 18 records

The only output is a fifty-page narrative. To trust it, the doctor has to re-read the whole chart and check every claim, which can take as long as writing the report from scratch.

The problem is the same as the other examples, but the stakes are higher. The only output is the report, and the doctor is the one accountable for it. To trust it they have to go back through the chart and confirm the facts and inferences themselves. That can take as long as writing the report from scratch, which defeats the point of the tool.

You might object that a fifty-page opinion is hard to verify. That is true, and the product should not pretend otherwise. Helping a doctor understand the evidence is arguably more valuable than the finished document. Therefore, I advised the founder to make the product work like a research assistant instead of a report generator.

For example, the product could read every record and pull out relevant facts, with a link back to the page so the doctor can check each one. Where two exams disagree, or the chart leaves a question open, the product should surface that. The doctor can then resolve any contradictions and fill in the gaps. Finally, the product can assemble the final report from what they have already checked. Here is a sketch of what this might look like:

ClaimDraft –□✕

FileEditCaseReviewReportHelp

Claimant J. DoeClaim #WC-20259417Date of injury Mar 3, 2025Examiner Dr. A. Patel

Records · 18

MRI, lumbar spine✓

Physical therapy notes✓

Treating physician notes✓

Prior IME○

Intake questionnaire○

X-ray report○

9 of 18 reviewed

1Contradiction

Straight-leg-raise is recorded as positive on the right by the treating physician (p. 31) and negative by the prior IME (p. 22).

ResolveOpen both pages

2Key fact

The intake form notes a prior lumbar strain in 2019, which bears on causation (p. 14).

IncludeDismiss

3Open question

No imaging appears in the file after March 18, 2025, so the current status of the herniation is unconfirmed.

Add a note

4Key fact

The MRI report describes pre-existing degenerative changes at L4-L5 (p. 9).

IncludeDismiss

6 of 24 findings still need reviewGenerate report draft

The product walks the doctor through the evidence first, surfacing each finding with a link back to the chart. Afterwards, the final report is built from the facts they reviewed.

The research assistant version of this product allows the doctor to build trust by verifying facts as they go. Similar to the other examples, this design is easier to build and evaluate. Now there are scoped units to grade, such as whether a contradiction is real or whether a citation supports a claim.

Generalizing the pattern

It is important to understand how users verify your product’s AI artifacts. Sometimes, this may require assembling supporting evidence. In other cases, it could mean refactoring the entire workflow so that the user is in the loop (like the workers’ comp example).

Below are questions that can guide your product’s design for verification:

What does the user actually need to check?
What trusted thing can they compare it against?
Are there signals or heuristics that experts use to aid in verification?
What smaller units can they accept, edit, or reject?

A common thread across these examples is provenance. The fastest way to make an output checkable is to show where each part came from, with links to see more detail. Additionally, you can use progressive disclosure so these sources don’t overwhelm the user.

What needs verifying also changes as the user’s trust grows. Early on, the data agent should make provenance obvious, like where a metric definition came from. Once the user trusts the agent gets it right, that detail can collapse by default in the card. Good design meets users where they are instead of showing everything.

These principles hold even when a product seems easy to eval. Coding is a good example: it is one of the most verifiable kinds of work, with tests, types, and diffs. Even so, some coding agents go the extra mile to make their work checkable. Cursor and Devin both record a short video of the UI changes they make, so you can confirm the work is right without reproducing it yourself.⁵

None of this is new

Evals thinking is aligned with good product design. Gathering supporting data and breaking down workflows into smaller units makes automated grading easier. However, I don’t want to pretend like any wisdom here is new.

All of these ideas stem from well-established design principles. For example, watching an expert work to learn what they check before you build is called needfinding.⁶ In research-heavy work like the medical case, there is a design goal called sensemaking, which is the work of building a structured understanding of a body of evidence you can reason over.⁷ There are many other concepts, but I think you get the idea.

Even though these ideas are well established, a reminder is due in the age of AI. Before AI, verification often happened incidentally during the process of creating work product. With AI, verification is the bottleneck. It is time to think about it more explicitly.

Video Walkthrough

I recorded a short video walking through these examples. If you’d rather watch than read, here it is:

Thanks to Shreya Shankar and Isaac Flath for feedback on this post.

Footnotes

The AI evals topic hub collects the rest of my evals work. I also co-teach the AI Evals for Engineers and PMs course.↩︎
Lenny Rachitsky tweeted about this recently: most of one data science team’s work is now reviewing half-baked AI analysis from PMs and engineers, and half of it is wrong.↩︎
When I worked at Airbnb we had an internal tool called the Knowledge Repo, a place where data scientists published notebooks of their deep dives on analytics, modeling, and so on (they wrote about it here). It was one of the fastest ways to get context on a new project, since you could read what someone had already worked out. I don’t know if it is still in use, but the paradigm is a good one.↩︎
Bryan Bischof led the creation of the AI at Hex. Bryan is a data scientist himself, the kind of domain expert the product serves, and I think that is part of why it is designed well.↩︎
See a demo of Cursor recording its UI changes, and Devin’s documentation on testing and recordings.↩︎
Dev Patnaik and Robert Becker, “Needfinding: The Why and How of Uncovering People’s Needs”, Design Management Journal.↩︎
Daniel M. Russell, Mark J. Stefik, Peter Pirolli, and Stuart K. Card, “The Cost Structure of Sensemaking”.↩︎