Evals Skills for Coding Agents

Today, I’m publishing evals-skills, a set of skills that for AI product evals1. They distill what I’ve learned implementing evals at 50+ companies and teaching them to 4,000+ students in our course.
Why Skills for Evals
Coding agents now instrument applications, run experiments, analyze data, and build interfaces. I’ve been pointing them at evals.
OpenAI’s Harness Engineering article makes the case well: they built a product entirely with Codex agents (~1 million lines of code, 1,500 PRs, three engineers, five months) and found that improving the infrastructure around the agent yielded better returns than improving the model. Their agents queried distributed traces to verify their own work against runtime evidence. Documentation tells the agent what to do. Telemetry tells it whether it worked. Evals apply the same principle to AI output quality.
All major eval vendors now ship an MCP server2. The tedious parts: instrumenting your app, orchestrating experiments, building annotation tools, now belong to coding agents.
But an agent with an eval platform still needs to know what to do with it. Say a support bot tells a customer “your plan includes free returns” when it doesn’t. Another says “I’ve canceled your order” when nobody asked. Both are hallucinations, but one gets a fact wrong and the other makes up a user action. If you lump them together in a generic “hallucination score” real problems will likely go undetected.
These skills fill in the gaps. They complement the vendor MCP servers: those give your agent access to traces and experiments, these teach it what to do with them.
The Skills
If you’re new to evals or inheriting an existing eval pipeline, start with eval-audit. It inspects your current setup (or lack of one), runs diagnostic checks across six areas (error analysis, evaluator design, judge validation, human review, labeled data, pipeline hygiene), and produces a prioritized list of problems with next steps. Install the skills and give your agent this prompt:
Install the eval skills plugin from https://github.com/hamelsmu/evals-skills, then run /evals-skills:eval-audit on my eval pipeline. Investigate each diagnostic area using a separate subagent in parallel, then synthesize the findings into a single report. Use other skills in the plugin as recommended by the audit.
If you’re experienced with evals, you can skip the audit and pick the skill you need:
| Skill | What it does |
|---|---|
| error-analysis | Read traces, categorize failures, build a vocabulary of what’s broken |
| generate-synthetic-data | Create diverse test inputs when real data is sparse |
| write-judge-prompt | Design binary Pass/Fail LLM-as-Judge evaluators |
| validate-evaluator | Calibrate judges against human labels using TPR/TNR and bias correction |
| evaluate-rag | Evaluate retrieval and generation quality separately |
| build-review-interface | Generate annotation interfaces for human trace review |
These skills are a starting point. They only cover parts of evals that generalize across projects. Skills grounded in your stack, your domain, and your data will outperform them. Start here, then write your own.
👉 The repo is here: github.com/hamelsmu/evals-skills 👈
Footnotes
Not foundation model benchmarks like MMLU or HELM that measure general LLM capabilities. Product evals measure whether your pipeline works on your task with your data. If you aren’t familiar with product-specific AI evals, check out my AI Evals FAQ.↩︎
Braintrust, LangSmith, Phoenix, Truesight, and others.↩︎