<?xml version="1.0" encoding="UTF-8"?>
<rss  xmlns:atom="http://www.w3.org/2005/Atom" 
      xmlns:media="http://search.yahoo.com/mrss/" 
      xmlns:content="http://purl.org/rss/1.0/modules/content/" 
      xmlns:dc="http://purl.org/dc/elements/1.1/" 
      version="2.0">
<channel>
<title>Hamel&#39;s Blog - Hamel Husain</title>
<link>https://hamel.dev/</link>
<atom:link href="https://hamel.dev/index.xml" rel="self" type="application/rss+xml"/>
<description>Notes on applied AI engineering, machine learning, and data science.</description>
<image>
<url>https://hamel.dev/quarto.png</url>
<title>Hamel&#39;s Blog - Hamel Husain</title>
<link>https://hamel.dev/</link>
<height>81</height>
<width>144</width>
</image>
<generator>quarto-1.8.25</generator>
<lastBuildDate>Mon, 02 Mar 2026 08:00:00 GMT</lastBuildDate>
<item>
  <title>Evals Skills for Coding Agents</title>
  <dc:creator>Hamel Husain</dc:creator>
  <link>https://hamel.dev/blog/posts/evals-skills/</link>
  <description><![CDATA[ 

<!-- Content inserted at the beginning of body tag -->
<!-- Google Tag Manager (noscript) -->
<noscript><iframe src="https://www.googletagmanager.com/ns.html?id=GTM-PKGWQMKL" height="0" width="0" style="display:none;visibility:hidden"></iframe></noscript>
  <!-- End Google Tag Manager (noscript) -->




<p><img src="https://hamel.dev/blog/posts/evals-skills/cover-original.png" class="img-fluid"></p>
<p>Today, I’m publishing <a href="https://github.com/hamelsmu/evals-skills">evals-skills</a>, a set of skills for AI product evals<sup>1</sup>. They guard against common mistakes I’ve seen helping 50+ companies and teaching 4,000+ students in our <a href="https://maven.com/parlance-labs/evals">course</a>.</p>
<section id="why-skills-for-evals" class="level2">
<h2 class="anchored" data-anchor-id="why-skills-for-evals">Why Skills for Evals</h2>
<p>Coding agents now instrument applications, run experiments, analyze data, and build interfaces. I’ve been pointing them at evals.</p>
<p>OpenAI’s Harness Engineering <a href="https://openai.com/index/harness-engineering/">article</a> makes the case well. They built a product entirely with Codex agents — three engineers, five months, ~1 million lines of code — and found that <strong>improving the infrastructure around the agent</strong> mattered more than improving the model. The agents queried traces to verify their own work. Documentation tells the agent what to do. Telemetry tells it whether it worked. Evals tell it whether the output is good.</p>
<p>All major eval vendors now ship an MCP server<sup>2</sup>. The tedious parts: instrumenting your app, orchestrating experiments and building annotation tools now fall to coding agents.</p>
<p>But an agent with an eval platform still needs to know what to do with it. Say a support bot tells a customer “your plan includes free returns” when it doesn’t. Another says “I’ve canceled your order” when nobody asked. Both are hallucinations, but one gets a fact wrong and the other makes up a user action. If you lump them together in a generic “hallucination score,” you’ll miss errors.</p>
<p>These skills fill the gaps. They complement the vendor MCP servers: those give your agent access to traces and experiments, these teach it what to do with them.</p>
</section>
<section id="the-skills" class="level2">
<h2 class="anchored" data-anchor-id="the-skills">The Skills</h2>
<p>If you’re new to evals or inheriting an existing eval pipeline, start with <strong>eval-audit</strong>. It inspects your current setup (or lack of one), runs diagnostic checks across six areas, and produces a prioritized list of problems with next steps. Install the skills or give your agent this prompt:</p>
<blockquote class="blockquote">
<p>Install the eval skills plugin from https://github.com/hamelsmu/evals-skills, then run /evals-skills:eval-audit on my eval pipeline. Investigate each diagnostic area using a separate subagent in parallel, then synthesize the findings into a single report. Use other skills in the plugin as recommended by the audit.</p>
</blockquote>
<p>If you’re experienced with evals, you can skip the audit and pick the skill you need:</p>
<table class="caption-top table">
<colgroup>
<col style="width: 35%">
<col style="width: 65%">
</colgroup>
<thead>
<tr class="header">
<th>Skill</th>
<th>What it does</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td>error-analysis</td>
<td>Read traces, categorize failures, build a vocabulary of what’s broken</td>
</tr>
<tr class="even">
<td>generate-synthetic-data</td>
<td>Create diverse test inputs when real data is sparse</td>
</tr>
<tr class="odd">
<td>write-judge-prompt</td>
<td>Design binary Pass/Fail LLM-as-Judge evaluators</td>
</tr>
<tr class="even">
<td>validate-evaluator</td>
<td>Calibrate judges against human labels using TPR/TNR and bias correction</td>
</tr>
<tr class="odd">
<td>evaluate-rag</td>
<td>Evaluate retrieval and generation quality separately</td>
</tr>
<tr class="even">
<td>build-review-interface</td>
<td>Generate annotation interfaces for human trace review</td>
</tr>
</tbody>
</table>
<p><br></p>
<p>These skills are a starting point and only encode common mistakes that generalize across projects. Skills grounded in your stack, your domain, and your data will outperform them. Start here, then write your own.</p>
<p>👉 The repo is here: <a href="https://github.com/hamelsmu/evals-skills">github.com/hamelsmu/evals-skills</a> 👈</p>
<p>If these skills help you, I’d love to hear from you! You can find me on <a href="https://x.com/hamelhusain">X</a> or email me through my <a href="https://ai.hamel.dev">newsletter</a>.</p>


</section>


<div id="quarto-appendix" class="default"><section id="footnotes" class="footnotes footnotes-end-of-document"><h2 class="anchored quarto-appendix-heading">Footnotes</h2>

<ol>
<li id="fn1"><p>Not foundation model benchmarks like MMLU or HELM that measure general LLM capabilities. Product evals measure whether <em>your</em> pipeline works on <em>your</em> task with <em>your</em> data. If you aren’t familiar with product-specific AI evals, check out my <a href="../../../blog/posts/evals-faq/index.html">AI Evals FAQ</a>.↩︎</p></li>
<li id="fn2"><p><a href="https://www.braintrust.dev/docs/reference/mcp">Braintrust</a>, <a href="https://github.com/langchain-ai/langsmith-mcp-server">LangSmith</a>, <a href="https://github.com/Arize-ai/phoenix/tree/main/js/packages/phoenix-mcp">Phoenix</a>, <a href="https://truesight.goodeyelabs.com/docs/mcp-integration">Truesight</a>, and others.↩︎</p></li>
</ol>
</section></div> ]]></description>
  <category>AI</category>
  <category>Evals</category>
  <guid>https://hamel.dev/blog/posts/evals-skills/</guid>
  <pubDate>Mon, 02 Mar 2026 08:00:00 GMT</pubDate>
  <media:content url="https://hamel.dev/blog/posts/evals-skills/cover-gradient.png" medium="image" type="image/png" height="81" width="144"/>
</item>
<item>
  <title>Why I Stopped Using nbdev</title>
  <dc:creator>Hamel Husain</dc:creator>
  <link>https://hamel.dev/blog/posts/ai-stack/</link>
  <description><![CDATA[ 

<!-- Content inserted at the beginning of body tag -->
<!-- Google Tag Manager (noscript) -->
<noscript><iframe src="https://www.googletagmanager.com/ns.html?id=GTM-PKGWQMKL" height="0" width="0" style="display:none;visibility:hidden"></iframe></noscript>
  <!-- End Google Tag Manager (noscript) -->




<p>Programmers love to proclaim they’ve found the best tool. Paul Graham called Lisp his “<a href="https://paulgraham.com/avg.html">secret weapon</a>.” DHH described Ruby as “<a href="https://rubyonrails.org/doctrine">a magical glove that just fit my brain perfectly</a>.” Pieter Levels ships million-dollar products with <a href="https://lexfridman.com/pieter-levels-transcript/">vanilla PHP and jQuery</a>.</p>
<p>These declarations aren’t about the languages themselves. They’re about developers finding tools that fit how they think. When the environment clicks, you move fast.</p>
<p>I had that experience with <a href="https://www.fast.ai/posts/2022-07-28-nbdev2.html">nbdev</a>, a development environment for literate programming that I helped build and maintain<sup>1</sup>. I created hundreds of projects with it and was one of its <a href="https://www.youtube.com/watch?v=rX1yGxJijsI">biggest</a> <a href="https://github.blog/developer-skills/programming-languages-and-frameworks/nbdev-a-literate-programming-environment-that-democratizes-software-engineering-best-practices/">proponents</a>.</p>
<p>Today, I no longer use it. AI coding tools changed the trade-offs.</p>
<section id="fighting-the-ai" class="level2">
<h2 class="anchored" data-anchor-id="fighting-the-ai">Fighting the AI</h2>
<p>The beauty of nbdev is its workflow. You write code, documentation and tests in one source of truth: Jupyter notebooks. Afterwards, these notebooks are transpiled into a Python library and documentation website.</p>
<p>This workflow is idiosyncratic. AI coding tools, trained on vast amounts of conventional source code, get confused. They struggle to differentiate between editing the notebook and editing the final source code. It feels like fighting the AI instead of working with it.</p>
<p>I write software to solve problems, not to write code. I want to work in an environment where AI has the highest chance of success. With nbdev, I was swimming upstream.</p>
<p>Some argue that AI tools encourage lazy thinking: that without guardrails, developers skip the hard work of breaking problems into steps. But thinking step-by-step is a human skill. Notebooks don’t force you to write clean code. AI tools don’t force you to think carefully. Discipline comes from the developer, not the environment.</p>
</section>
<section id="tools-dont-matter-as-much-as-i-thought" class="level2">
<h2 class="anchored" data-anchor-id="tools-dont-matter-as-much-as-i-thought">Tools Don’t Matter As Much As I Thought</h2>
<p>A central promise of literate programming is better documentation. By keeping code and docs in one place, you reduce the chance they become stale.</p>
<p>Strangely, many nbdev projects lacked sufficient documentation for my taste. Sometimes, this helped me learn a codebase by <a href="https://fastpages.fast.ai/fastcore/">contributing</a> to the docs. Other times, it was frustrating. This reinforced my belief that good documentation comes from effort, not tooling.</p>
<p>This workflow is also less compelling now. AI can read a codebase without documentation and give you an overview on the fly. It can help maintain documentation that is separate from the code, handling the tedious parts. Keeping code and docs together isn’t the selling point it used to be.</p>
</section>
<section id="collaboration-and-adoption" class="level2">
<h2 class="anchored" data-anchor-id="collaboration-and-adoption">Collaboration and Adoption</h2>
<p>nbdev asks developers to adopt a different system. It does not meet them where they are. Cursor won because it felt familiar and let developers change their habits slowly, rather than demanding a new workflow on day one.</p>
<p>I didn’t worry about collaboration as much before. But collaborating with AI is table stakes. The same impediments that get in the way of collaborating with humans tend to get in the way of collaborating with AI.</p>
<p>Today, developers increasingly span greater scope. Backend people do frontend. PMs create prototypes. Everyone is more polyglot. Idiosyncratic frameworks isolate you from your team to a greater extent than ever before. Idiosyncratic tooling once had a hidden upside: it filtered for a certain kind of contributor. Now I believe it’s more of a liability.</p>
</section>
<section id="what-im-using-now" class="level2">
<h2 class="anchored" data-anchor-id="what-im-using-now">What I’m Using Now</h2>
<p>Because I have invested thousands of hours into nbdev, it’s difficult to admit there are better tools for the outcomes I want. But I must check my ego at the door<sup>2</sup>.</p>
<p>My daily drivers now include Amp, Cursor, and Claude Code. I still enjoy notebooks, but only for data analysis, machine learning, or other exploratory workflows where the iterative, visual nature of notebooks shines.</p>
<p>More importantly, AI has nudged me out of my previous “Python for everything” mindset. I now use different languages for different tasks. For web development, I prefer the Next.js stack. Using a notebook for web development (even with specialized tooling) adds unnecessary complexity for me.</p>
<p>This isn’t arbitrary preference. AI performs best on code with abundant training data for specific domains. <a href="https://github.blog/news-insights/octoverse/octoverse-a-new-developer-joins-github-every-second-as-ai-leads-typescript-to-1/">TypeScript recently overtook both Python and JavaScript on GitHub</a>, driven partly by the fact that typed languages make AI-generated code more reliable in production<sup>3</sup>. Even Jane Street, famous for its OCaml-heavy infrastructure, <a href="https://www.efinancialcareers.com/news/python-ocaml-jane-street-ai">now uses Python</a> for machine learning and data work.</p>
</section>
<section id="a-place-for-joy" class="level2">
<h2 class="anchored" data-anchor-id="a-place-for-joy">A Place for Joy</h2>
<p>None of this means idiosyncratic tools are worthless. Lisp, Haskell, and APL each teach you something different about computing. Joy is a valid reason to choose a language, even with less AI support. It’s just not my focus right now. My joy resides in solving problems, and I want tools that maximize my leverage. For that, conventional wins.</p>


</section>


<div id="quarto-appendix" class="default"><section id="footnotes" class="footnotes footnotes-end-of-document"><h2 class="anchored quarto-appendix-heading">Footnotes</h2>

<ol>
<li id="fn1"><p>I <a href="https://github.blog/developer-skills/programming-languages-and-frameworks/nbdev-a-literate-programming-environment-that-democratizes-software-engineering-best-practices/">joined the nbdev project in 2020</a> while at GitHub. That same year I built <a href="https://www.youtube.com/watch?v=cduXZwZaBbM">fastpages</a>, a notebook blogging system that informed nbdev’s documentation approach. Along the way I contributed to tools like <a href="https://github.blog/developer-skills/programming-languages-and-frameworks/learn-about-ghapi-a-new-third-party-python-client-for-the-github-api/">ghapi</a> and <a href="https://fastpages.fast.ai/fastcore/">fastcore</a>. In 2022, I helped lead a <a href="https://www.fast.ai/posts/2022-07-28-nbdev2.html">complete rewrite</a> of nbdev. I discussed the philosophy behind this work on the <a href="https://vanishinggradients.fireside.fm/9">Vanishing Gradients Podcast</a> and at <a href="https://www.youtube.com/watch?v=rX1yGxJijsI">Data Council</a>. I was briefly interested in commercializing nbdev, but <a href="../../../blog/posts/nbdev/">decided not to pursue it</a>.↩︎</p></li>
<li id="fn2"><p>Other top nbdev maintainers and power users like <a href="https://github.com/sgugger">Sylvain Gugger</a>, <a href="https://github.com/seeM">Wasim Lorgat</a>, <a href="https://github.com/isaac-flath">Isaac Flath</a>, <a href="https://github.com/muellerzr">Zach Mueller</a>, and <a href="https://ohmeow.com/">Wayde Gilliam</a> have made similar moves.↩︎</p></li>
<li id="fn3"><p>Research shows 94% of LLM compilation errors in TypeScript come from type violations, suggesting type systems can guide better code generation. See <a href="https://arxiv.org/abs/2504.09246">Mündler et al., 2025</a>.↩︎</p></li>
</ol>
</section></div> ]]></description>
  <category>AI</category>
  <category>nbdev</category>
  <category>Tools</category>
  <guid>https://hamel.dev/blog/posts/ai-stack/</guid>
  <pubDate>Sun, 18 Jan 2026 08:00:00 GMT</pubDate>
  <media:content url="https://hamel.dev/blog/posts/ai-stack/nbdev2-cover.png" medium="image" type="image/png" height="81" width="144"/>
</item>
<item>
  <title>LLM Evals: Everything You Need to Know</title>
  <dc:creator>Hamel Husain</dc:creator>
  <dc:creator>Shreya Shankar</dc:creator>
  <link>https://hamel.dev/blog/posts/evals-faq/</link>
  <description><![CDATA[ 

<!-- Content inserted at the beginning of body tag -->
<!-- Google Tag Manager (noscript) -->
<noscript><iframe src="https://www.googletagmanager.com/ns.html?id=GTM-PKGWQMKL" height="0" width="0" style="display:none;visibility:hidden"></iframe></noscript>
  <!-- End Google Tag Manager (noscript) -->
<script async="" data-uid="df8af2b6ed" src="https://hamel.kit.com/df8af2b6ed/index.js"></script>




<p>This document curates the most common questions Shreya and I received while <a href="https://bit.ly/evals-ai" target="_blank">teaching</a> 700+ engineers &amp; PMs AI Evals. <em>Warning: These are sharp opinions about what works in most cases. They are not universal truths. Use your judgment.</em></p>
<hr>
<div class="cta" style="text-align: center;">
<p><strong>👉 <em>Want to learn more about AI Evals? Check out our <a href="https://bit.ly/evals-ai" target="_blank">AI Evals course</a></em></strong>. It’s a live cohort with hands on exercises and office hours. Here is a <a href="https://bit.ly/evals-ai" target="_blank">25% discount code</a> for readers. 👈</p>
</div>
<hr>
<section id="listen-to-the-audio-version-of-this-faq" class="level1">
<h1>Listen to the audio version of this FAQ</h1>
<p>If you prefer to listen to the audio version (narrated by AI), you can play it <a href="https://soundcloud.com/hamel-husain/llm-evals-faq">here</a>.</p>
<iframe width="100%" height="166" scrolling="no" frameborder="no" allow="autoplay" src="https://w.soundcloud.com/player/?url=https%3A//api.soundcloud.com/tracks/2138083206&amp;color=%23447099&amp;auto_play=false&amp;hide_related=true&amp;show_comments=false&amp;show_user=true&amp;show_reposts=false&amp;show_teaser=true">
</iframe>
<div style="font-size: 10px; color: #cccccc;line-break: anywhere;word-break: normal;overflow: hidden;white-space: nowrap;text-overflow: ellipsis; font-family: Interstate,Lucida Grande,Lucida Sans Unicode,Lucida Sans,Garuda,Verdana,Tahoma,sans-serif;font-weight: 100;">
<a href="https://soundcloud.com/hamel-husain" title="Hamel Husain" target="_blank" style="color: #cccccc; text-decoration: none;">Hamel Husain</a> · <a href="https://soundcloud.com/hamel-husain/llm-evals-faq" title="LLM Evals FAQ" target="_blank" style="color: #cccccc; text-decoration: none;">LLM Evals FAQ</a>
</div>
</section>
<section id="getting-started-fundamentals" class="level1">
<h1>Getting Started &amp; Fundamentals</h1>
<section id="q-what-are-llm-evals" class="level2">
<h2 class="anchored" data-anchor-id="q-what-are-llm-evals">Q: What are LLM Evals?</h2>
<p>If you are completely new to product-specific LLM evals (not foundation model benchmarks), see these posts: <a href="../../../blog/posts/evals/index.html" target="_blank">part 1</a>, <a href="../../../blog/posts/llm-judge/index.html" target="_blank">part 2</a> and <a href="../../../blog/posts/field-guide/index.html" target="_blank">part 3</a>. Otherwise, keep reading.</p>
<div class="grid">
<div class="g-col-4">
<p><a href="https://hamel.dev/evals" target="_blank"><img src="https://hamel.dev/blog/posts/evals/images/diagram-cover.png" class="img-fluid"></a></p>
<p><a href="https://hamel.dev/evals" target="_blank"><strong>Your AI Product Needs Eval (Evaluation Systems)</strong></a></p>
<p><strong>Contents:</strong></p>
<ol type="1">
<li>Motivation</li>
<li>Iterating Quickly == Success<br>
</li>
<li>Case Study: Lucy, A Real Estate AI Assistant</li>
<li>The Types Of Evaluation
<ol type="a">
<li>Level 1: Unit Tests</li>
<li>Level 2: Human &amp; Model Eval</li>
<li>Level 3: A/B Testing</li>
<li>Evaluating RAG</li>
</ol></li>
<li>Eval Systems Unlock Superpowers For Free
<ol type="a">
<li>Fine-Tuning</li>
<li>Data Synthesis &amp; Curation</li>
<li>Debugging</li>
</ol></li>
</ol>
</div>
<div class="g-col-4">
<p><a href="https://hamel.dev/llm-judge/" target="_blank"><img src="https://hamel.dev/blog/posts/llm-judge/images/cover_img.png" class="img-fluid"></a></p>
<p><a href="https://hamel.dev/llm-judge/" target="_blank"><strong>Creating a LLM-as-a-Judge That Drives Business Results</strong></a></p>
<p><strong>Contents:</strong></p>
<ol type="1">
<li>The Problem: AI Teams Are Drowning in Data</li>
<li>Step 1: Find The Principal Domain Expert</li>
<li>Step 2: Create a Dataset</li>
<li>Step 3: Direct The Domain Expert to Make Pass/Fail Judgments with Critiques</li>
<li>Step 4: Fix Errors</li>
<li>Step 5: Build Your LLM as A Judge, Iteratively</li>
<li>Step 6: Perform Error Analysis</li>
<li>Step 7: Create More Specialized LLM Judges (if needed)</li>
<li>Recap of Critique Shadowing</li>
<li>Resources</li>
</ol>
</div>
<div class="g-col-4">
<p><a href="https://hamel.dev/field-guide" target="_blank"><img src="https://hamel.dev/blog/posts/field-guide/images/field_guide_2.png" class="img-fluid"></a></p>
<p><a href="https://hamel.dev/field-guide" target="_blank"><strong>A Field Guide to Rapidly Improving AI Products</strong></a></p>
<p><strong>Contents:</strong></p>
<ol type="1">
<li>How error analysis consistently reveals the highest-ROI improvements</li>
<li>Why a simple data viewer is your most important AI investment</li>
<li>How to empower domain experts (not just engineers) to improve your AI</li>
<li>Why synthetic data is more effective than you think</li>
<li>How to maintain trust in your evaluation system</li>
<li>Why your AI roadmap should count experiments, not features</li>
</ol>
</div>
</div>
<p><a href="../../../blog/posts/evals-faq/what-are-llm-evals.html" class="faq-individual-link">↗ Focus view</a></p>
</section>
<section id="q-what-is-a-trace" class="level2">
<h2 class="anchored" data-anchor-id="q-what-is-a-trace">Q: What is a trace?</h2>
<p>A trace is the complete record of all actions, messages, tool calls, and data retrievals from a single initial user query through to the final response. It includes every step across all agents, tools, and system components in a session: multiple user messages, assistant responses, retrieved documents, and intermediate tool interactions.</p>
<p><strong>Note on terminology:</strong> Different observability vendors use varying definitions of traces and spans. <a href="https://mlops.systems/posts/2025-06-04-instrumenting-an-agentic-app-with-arize-phoenix-and-litellm.html#llm-tracing-tools-naming-conventions-june-2025">Alex Strick van Linschoten’s analysis</a> highlights these differences (screenshot below):</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://hamel.dev/blog/posts/evals-faq/alex.jpeg" class="img-fluid figure-img"></p>
<figcaption>Vendor differences in trace definitions as of 2025-07-02</figcaption>
</figure>
</div>
<p><a href="../../../blog/posts/evals-faq/what-is-a-trace.html" class="faq-individual-link">↗ Focus view</a></p>
</section>
<section id="q-whats-a-minimum-viable-evaluation-setup" class="level2">
<h2 class="anchored" data-anchor-id="q-whats-a-minimum-viable-evaluation-setup">Q: What’s a minimum viable evaluation setup?</h2>
<p>Start with error analysis, not infrastructure. Spend 30 minutes manually reviewing 20-50 LLM outputs whenever you make significant changes. Use one domain expert who understands your users as your quality decision maker (a “benevolent dictator”).</p>
<p>If possible, <strong>use notebooks</strong> to help you review traces and analyze data. In our opinion, this is the single most effective tool for evals because you can write arbitrary code, visualize data, and iterate quickly. You can even build your own custom annotation interface right inside notebooks, as shown in this <a href="https://youtu.be/aqKUwPKBkB0?si=5KDmMQnRzO_Ce9xH" target="_blank">video</a>.</p>
<div class="quarto-video ratio ratio-16x9"><iframe data-external="1" src="https://www.youtube.com/embed/aqKUwPKBkB0" title="" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen=""></iframe></div>
<p><a href="../../../blog/posts/evals-faq/whats-a-minimum-viable-evaluation-setup.html" class="faq-individual-link">↗ Focus view</a></p>
</section>
<section id="q-how-much-of-my-development-budget-should-i-allocate-to-evals" class="level2">
<h2 class="anchored" data-anchor-id="q-how-much-of-my-development-budget-should-i-allocate-to-evals">Q: How much of my development budget should I allocate to evals?</h2>
<p>It’s important to recognize that evaluation is part of the development process rather than a distinct line item, similar to how debugging is part of software development.</p>
<p>You should always be doing <a href="https://www.youtube.com/watch?v=qH1dZ8JLLdU" target="_blank">error analysis</a>. When you discover issues through error analysis, many will be straightforward bugs you’ll fix immediately. These fixes don’t require separate evaluation infrastructure as they’re just part of development.</p>
<p>The decision to build automated evaluators comes down to cost-benefit analysis. If you can catch an error with a simple assertion or regex check, the cost is minimal and probably worth it. But if you need to align an LLM-as-judge evaluator, consider whether the failure mode warrants that investment.</p>
<p>In the projects we’ve worked on, <strong>we’ve spent 60-80% of our development time on error analysis and evaluation</strong>. Expect most of your effort to go toward understanding failures (i.e.&nbsp;looking at data) rather than building automated checks.</p>
<p>Be <a href="https://ai-execs.com/2_intro.html#a-case-study-in-misleading-ai-advice" target="_blank">wary of optimizing for high eval pass rates</a>. If you’re passing 100% of your evals, you’re likely not challenging your system enough. A 70% pass rate might indicate a more meaningful evaluation that’s actually stress-testing your application. Focus on evals that help you catch real issues, not ones that make your metrics look good.</p>
<p><a href="../../../blog/posts/evals-faq/how-much-of-my-development-budget-should-i-allocate-to-evals.html" class="faq-individual-link">↗ Focus view</a></p>
</section>
<section id="q-will-todays-evaluation-methods-still-be-relevant-in-5-10-years-given-how-fast-ai-is-changing" class="level2">
<h2 class="anchored" data-anchor-id="q-will-todays-evaluation-methods-still-be-relevant-in-5-10-years-given-how-fast-ai-is-changing">Q: Will today’s evaluation methods still be relevant in 5-10 years given how fast AI is changing?</h2>
<p>Yes. Even with perfect models, you still need to verify they’re solving the right problem. The need for systematic error analysis, domain-specific testing, and monitoring will still be important.</p>
<p>Today’s prompt engineering tricks might become obsolete, but you’ll still need to understand failure modes. Additionally, a LLM cannot read your mind, and <a href="https://arxiv.org/abs/2404.12272" target="_blank">research shows</a> that people need to observe the LLM’s behavior in order to properly externalize their requirements.</p>
<p>For deeper perspective on this debate, see these two viewpoints: <a href="https://m.youtube.com/watch?si=qknrtQeITqJ7VsJH&amp;v=4dUFIRj-BWo&amp;feature=youtu.be" target="_blank">“The model is the product”</a> versus <a href="https://www.youtube.com/watch?v=EEw2PpL-_NM" target="_blank">“The model is NOT the product”</a>.</p>
<p><strong>“The model is the product”:</strong> </p><div class="quarto-video ratio ratio-16x9"><iframe data-external="1" src="https://www.youtube.com/embed/4dUFIRj-BWo" title="" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen=""></iframe></div><p></p>
<p><strong>“The model is NOT the product”:</strong> </p><div class="quarto-video ratio ratio-16x9"><iframe data-external="1" src="https://www.youtube.com/embed/EEw2PpL-_NM" title="" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen=""></iframe></div><p></p>
<p><a href="../../../blog/posts/evals-faq/will-these-evaluation-methods-still-be-relevant-in-5-10-years-given-how-fast-ai-is-changing.html" class="faq-individual-link">↗ Focus view</a></p>
</section>
<section id="q-how-do-i-make-the-case-for-investing-in-evaluations-to-my-team" class="level2">
<h2 class="anchored" data-anchor-id="q-how-do-i-make-the-case-for-investing-in-evaluations-to-my-team">Q: How do I make the case for investing in evaluations to my team?</h2>
<p>Don’t try to sell your team on “evals”. Instead, show them what you find when you look at the data.</p>
<p>Start by doing the error analysis yourself. Look at 50 to 100 real user conversations and find the most common ways the product is failing. Use these findings to tell a story with data.</p>
<p>Present your team with:</p>
<ul>
<li>A list of the top failure modes you discovered.</li>
<li>Metrics showing how often high-impact errors are happening.</li>
<li>Surprising ways that users are interacting with the product.</li>
<li>Reports on the bugs you found and fixed, framed as “prevented production issues”.</li>
</ul>
<p>This approach builds trust. Don’t just show dashboards and metrics; tell the story of what you’re finding in the data. By narrating your findings, you teach the team what you’re learning, providing immediate value. When you fix an issue, show how the error rate for that specific problem went down. Soon, your team will see the progress and ask how you’re doing it. Let results instead of methods lead the conversation.</p>
<p>This is similar to classic machine learning projects, where outcomes are speculative and progress is bounded by <a href="https://hamel.dev/blog/posts/field-guide/#your-ai-roadmap-should-count-experiments-not-features" target="_blank">iterating on experiments</a>. In this situation, it’s important that you share the learnings from each experiment to show progress and encourage investment.</p>
<p><a href="../../../blog/posts/evals-faq/how-do-i-make-the-case-for-investing-in-evaluations-to-my-team.html" class="faq-individual-link">↗ Focus view</a></p>
</section>
</section>
<section id="error-analysis-data-collection" class="level1">
<h1>Error Analysis &amp; Data Collection</h1>
<section id="q-why-is-error-analysis-so-important-in-llm-evals-and-how-is-it-performed" class="level2">
<h2 class="anchored" data-anchor-id="q-why-is-error-analysis-so-important-in-llm-evals-and-how-is-it-performed">Q: Why is "error analysis" so important in LLM evals, and how is it performed?</h2>
<p>Error analysis is <strong>the most important activity in evals</strong>. Error analysis helps you decide what evals to write in the first place. It allows you to identify failure modes unique to your application and data. The process involves:</p>
<section id="creating-a-dataset" class="level3">
<h3 class="anchored" data-anchor-id="creating-a-dataset">1. Creating a Dataset</h3>
<p>Gathering representative traces of user interactions with the LLM. If you do not have any data, you can generate synthetic data to get started.</p>
</section>
<section id="open-coding" class="level3">
<h3 class="anchored" data-anchor-id="open-coding">2. Open Coding</h3>
<p>Human annotator(s) (ideally a benevolent dictator) review and write open-ended notes about traces, noting any issues. This process is akin to “journaling” and is adapted from qualitative research methodologies. When beginning, it is recommended to focus on noting the first failure observed in a trace, as upstream errors can cause downstream issues, though you can also tag all independent failures if feasible. A <a href="https://hamel.dev/blog/posts/llm-judge/#step-1-find-the-principal-domain-expert" target="_blank">domain expert</a> should be performing this step.</p>
</section>
<section id="axial-coding" class="level3">
<h3 class="anchored" data-anchor-id="axial-coding">3. Axial Coding</h3>
<p>Categorize the open-ended notes into a “failure taxonomy.”. In other words, group similar failures into distinct categories. This is the most important step. At the end, count the number of failures in each category. You can use a LLM to help with this step.</p>
</section>
<section id="iterative-refinement" class="level3">
<h3 class="anchored" data-anchor-id="iterative-refinement">4. Iterative Refinement</h3>
<p>Keep iterating on more traces until you reach <a href="https://delvetool.com/blog/theoreticalsaturation" target="_blank">theoretical saturation</a>, meaning new traces do not seem to reveal new failure modes or information to you. As a rule of thumb, you should aim to review at least 100 traces.</p>
<p>You should frequently revisit this process. There are advanced ways to <a href="how-can-i-efficiently-sample-production-traces-for-review.html" target="_blank">sample data more efficiently</a>, like clustering, sorting by user feedback, and sorting by high probability failure patterns. Over time, you’ll develop a “nose” for where to look for failures in your data.</p>
<p>Do not skip error analysis. It ensures that the evaluation metrics you develop are supported by real application behaviors instead of counter-productive generic metrics (which most platforms nudge you to use). For examples of how error analysis can be helpful, see <a href="https://www.youtube.com/watch?v=e2i6JbU2R-s" target="_blank">this video</a>, or this <a href="https://hamel.dev/blog/posts/field-guide/" target="_blank">blog post</a>.</p>
<p>Here is a visualization of the error analysis process by one of our students, <a href="https://www.linkedin.com/in/pawel-huryn/" target="_blank">Pawel Huryn</a> - including how it fits into the overall evaluation process:</p>
<p><img src="https://hamel.dev/blog/posts/evals-faq/pawel-error-analysis.png" class="img-fluid"></p>
<p><a href="../../../blog/posts/evals-faq/why-is-error-analysis-so-important-in-llm-evals-and-how-is-it-performed.html" class="faq-individual-link">↗ Focus view</a></p>
</section>
</section>
<section id="q-how-do-i-surface-problematic-traces-for-review-beyond-user-feedback" class="level2">
<h2 class="anchored" data-anchor-id="q-how-do-i-surface-problematic-traces-for-review-beyond-user-feedback">Q: How do I surface problematic traces for review beyond user feedback?</h2>
<p>While user feedback is a good way to narrow in on problematic traces, other methods are also useful. Here are three complementary approaches:</p>
<section id="start-with-random-sampling" class="level3">
<h3 class="anchored" data-anchor-id="start-with-random-sampling">Start with random sampling</h3>
<p>The simplest approach is reviewing a random sample of traces. If you find few issues, escalate to stress testing: create queries that deliberately test your prompt constraints to see if the AI follows your rules.</p>
</section>
<section id="use-evals-for-initial-screening" class="level3">
<h3 class="anchored" data-anchor-id="use-evals-for-initial-screening">Use evals for initial screening</h3>
<p>Use existing evals to find problematic traces and potential issues. Once you’ve identified these, you can proceed with the typical evaluation process starting with error analysis.</p>
</section>
<section id="leverage-efficient-sampling-strategies" class="level3">
<h3 class="anchored" data-anchor-id="leverage-efficient-sampling-strategies">Leverage efficient sampling strategies</h3>
<p>For more sophisticated trace discovery, use outlier detection, metric-based sorting, and stratified sampling to find interesting traces. Generic metrics can serve as exploration signals to identify traces worth reviewing, even if they don’t directly measure quality.</p>
<p><a href="../../../blog/posts/evals-faq/how-do-i-surface-problematic-traces-for-review-beyond-user-feedback.html" class="faq-individual-link">↗ Focus view</a></p>
</section>
</section>
<section id="q-how-often-should-i-re-run-error-analysis-on-my-production-system" class="level2">
<h2 class="anchored" data-anchor-id="q-how-often-should-i-re-run-error-analysis-on-my-production-system">Q: How often should I re-run error analysis on my production system?</h2>
<p>Re-run error analysis when making significant changes: new features, prompt updates, model switches, or major bug fixes. A useful heuristic is to set a goal for reviewing <em>at least</em> 100+ fresh traces each review cycle. Typical review cycles we’ve seen range from 2-4 weeks. See this FAQ on how to sample traces effectively.</p>
<p>Between major analyses, review 10-20 traces weekly, focusing on outliers: unusually long conversations, sessions with multiple retries, or traces flagged by automated monitoring. Adjust frequency based on system stability and usage growth. New systems need weekly analysis until failure patterns stabilize. Mature systems might need only monthly analysis unless usage patterns change. Always analyze after incidents, user complaint spikes, or metric drift. Scaling usage introduces new edge cases.</p>
<p><a href="../../../blog/posts/evals-faq/how-often-should-i-re-run-error-analysis-on-my-production-system.html" class="faq-individual-link">↗ Focus view</a></p>
</section>
<section id="q-what-is-the-best-approach-for-generating-synthetic-data" class="level2">
<h2 class="anchored" data-anchor-id="q-what-is-the-best-approach-for-generating-synthetic-data">Q: What is the best approach for generating synthetic data?</h2>
<p>A common mistake is prompting an LLM to <code>"give me test queries"</code> without structure, resulting in generic, repetitive outputs. A structured approach using dimensions produces far better synthetic data for testing LLM applications.</p>
<p><strong>Start by defining dimensions</strong>: categories that describe different aspects of user queries. Each dimension captures one type of variation in user behavior. For example:</p>
<ul>
<li>For a recipe app, dimensions might include Dietary Restriction (<em>vegan</em>, <em>gluten-free</em>, <em>none</em>), Cuisine Type (<em>Italian</em>, <em>Asian</em>, <em>comfort food</em>), and Query Complexity (<em>simple request</em>, <em>multi-step</em>, <em>edge case</em>).</li>
<li>For a customer support bot, dimensions could be Issue Type (<em>billing</em>, <em>technical</em>, <em>general</em>), Customer Mood (<em>frustrated</em>, <em>neutral</em>, <em>happy</em>), and Prior Context (<em>new issue</em>, <em>follow-up</em>, <em>resolved</em>).</li>
</ul>
<p><strong>Start with failure hypotheses</strong>. If you lack intuition about failure modes, use your application extensively or recruit friends to use it. Then choose dimensions targeting those likely failures.</p>
<p><strong>Create tuples manually first</strong>: Write 20 tuples by hand—specific combinations selecting one value from each dimension. Example: (<em>Vegan</em>, <em>Italian</em>, <em>Multi-step</em>). This manual work helps you understand your problem space.</p>
<p><strong>Scale with two-step generation</strong>:</p>
<ol type="1">
<li><strong>Generate structured tuples</strong>: Have the LLM create more combinations like (<em>Gluten-free</em>, <em>Asian</em>, <em>Simple</em>)</li>
<li><strong>Convert tuples to queries</strong>: In a separate prompt, transform each tuple into natural language</li>
</ol>
<p>This separation avoids repetitive phrasing. The (<em>Vegan</em>, <em>Italian</em>, <em>Multi-step</em>) tuple becomes: <code>"I need a dairy-free lasagna recipe that I can prep the day before."</code></p>
<section id="generation-approaches" class="level3">
<h3 class="anchored" data-anchor-id="generation-approaches">Generation approaches</h3>
<p>You can generate tuples two ways:</p>
<p><strong>Cross product then filter</strong>: Generate all dimension combinations, then filter with an LLM. Guarantees coverage including edge cases. Use when most combinations are valid.</p>
<p><strong>Direct LLM generation</strong>: Ask the LLM to generate tuples directly. More realistic but tends toward generic outputs and misses rare scenarios. Use when many dimension combinations are invalid.</p>
<p><strong>Fix obvious problems first</strong>: Don’t generate synthetic data for issues you can fix immediately. If your prompt doesn’t mention dietary restrictions, fix the prompt rather than generating specialized test queries.</p>
<p>After iterating on your tuples and prompts, <strong>run these synthetic queries through your actual system to capture full traces</strong>. Sample 100 traces for error analysis. This number provides enough traces to manually review and identify failure patterns without being overwhelming.</p>
<p><a href="../../../blog/posts/evals-faq/what-is-the-best-approach-for-generating-synthetic-data.html" class="faq-individual-link">↗ Focus view</a></p>
</section>
</section>
<section id="q-are-there-scenarios-where-synthetic-data-may-not-be-reliable" class="level2">
<h2 class="anchored" data-anchor-id="q-are-there-scenarios-where-synthetic-data-may-not-be-reliable">Q: Are there scenarios where synthetic data may not be reliable?</h2>
<p>Yes: synthetic data can mislead or mask issues. For guidance on generating synthetic data when appropriate, see What is the best approach for generating synthetic data?</p>
<p>Common scenarios where synthetic data fails:</p>
<ol type="1">
<li><p><strong>Complex domain-specific content</strong>: LLMs often miss the structure, nuance, or quirks of specialized documents (e.g., legal filings, medical records, technical forms). Without real examples, critical edge cases are missed.</p></li>
<li><p><strong>Low-resource languages or dialects</strong>: For low-resource languages or dialects, LLM-generated samples are often unrealistic. Evaluations based on them won’t reflect actual performance.</p></li>
<li><p><strong>When validation is impossible</strong>: If you can’t verify synthetic sample realism (due to domain complexity or lack of ground truth), real data is important for accurate evaluation.</p></li>
<li><p><strong>High-stakes domains</strong>: In high-stakes domains (medicine, law, emergency response), synthetic data often lacks subtlety and edge cases. Errors here have serious consequences, and manual validation is difficult.</p></li>
<li><p><strong>Underrepresented user groups</strong>: For underrepresented user groups, LLMs may misrepresent context, values, or challenges. Synthetic data can reinforce biases in the training data of the LLM.</p></li>
</ol>
<p><a href="../../../blog/posts/evals-faq/are-there-scenarios-where-synthetic-data-may-not-be-reliable.html" class="faq-individual-link">↗ Focus view</a></p>
</section>
<section id="q-how-do-i-approach-evaluation-when-my-system-handles-diverse-user-queries" class="level2">
<h2 class="anchored" data-anchor-id="q-how-do-i-approach-evaluation-when-my-system-handles-diverse-user-queries">Q: How do I approach evaluation when my system handles diverse user queries?</h2>
<blockquote class="blockquote">
<p>Complex applications often support vastly different query patterns—from “What’s the return policy?” to “Compare pricing trends across regions for products matching these criteria.” Each query type exercises different system capabilities, leading to confusion on how to design eval criteria.</p>
</blockquote>
<p><strong><em><a href="https://youtu.be/e2i6JbU2R-s?si=8p5XVxbBiioz69Xc" target="_blank">Error Analysis</a> is all you need.</em></strong> Your evaluation strategy should emerge from observed failure patterns (e.g.&nbsp;error analysis), not predetermined query classifications. Rather than creating a massive evaluation matrix covering every query type you can imagine, let your system’s actual behavior guide where you invest evaluation effort.</p>
<p>During error analysis, you’ll likely discover that certain query categories share failure patterns. For instance, all queries requiring temporal reasoning might struggle regardless of whether they’re simple lookups or complex aggregations. Similarly, queries that need to combine information from multiple sources might fail in consistent ways. These patterns discovered through error analysis should drive your evaluation priorities. It could be that query category is a fine way to group failures, but you don’t know that until you’ve analyzed your data.</p>
<p>To see an example of basic error analysis in action, <a href="https://youtu.be/e2i6JbU2R-s?si=8p5XVxbBiioz69Xc" target="_blank">see this video</a>.</p>
<div class="quarto-video ratio ratio-16x9"><iframe data-external="1" src="https://www.youtube.com/embed/e2i6JbU2R-s" title="" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen=""></iframe></div>
<p><a href="../../../blog/posts/evals-faq/how-do-i-approach-evaluation-when-my-system-handles-diverse-user-queries.html" class="faq-individual-link">↗ Focus view</a></p>
</section>
<section id="q-how-can-i-efficiently-sample-production-traces-for-review" class="level2">
<h2 class="anchored" data-anchor-id="q-how-can-i-efficiently-sample-production-traces-for-review">Q: How can I efficiently sample production traces for review?</h2>
<p>It can be cumbersome to review traces randomly, especially when most traces don’t have an error. These sampling strategies help you find traces more likely to reveal problems:</p>
<ul>
<li><strong>Outlier detection:</strong> Sort by any metric (response length, latency, tool calls) and review extremes.</li>
<li><strong>User feedback signals:</strong> Prioritize traces with negative feedback, support tickets, or escalations.</li>
<li><strong>Metric-based sorting:</strong> Generic metrics can serve as exploration signals to find interesting traces. Review both high and low scores and treat them as exploration clues. Based on what you learn, you can build custom evaluators for the failure modes you find.</li>
<li><strong>Stratified sampling:</strong> Group traces by key dimensions (user type, feature, query category) and sample from each group.</li>
<li><strong>Embedding clustering:</strong> Generate embeddings of queries and cluster them to reveal natural groupings. Sample proportionally from each cluster, but oversample small clusters for edge cases. There’s no right answer for clustering—it’s an exploration technique to surface patterns you might miss manually.</li>
</ul>
<p>As you get more sophisticated with how you sample, you can incorporate these tactics into the design of your annotation tools.</p>
<p><a href="../../../blog/posts/evals-faq/how-can-i-efficiently-sample-production-traces-for-review.html" class="faq-individual-link">↗ Focus view</a></p>
<hr>
<div class="cta" style="text-align: center;">
<p><strong>👉 <em>Want to learn more about AI Evals? Check out our <a href="https://bit.ly/evals-ai" target="_blank">AI Evals course</a></em></strong>. It’s a live cohort with hands on exercises and office hours. Here is a <a href="https://bit.ly/evals-ai" target="_blank">25% discount code</a> for readers. 👈</p>
</div>
<hr>
</section>
</section>
<section id="evaluation-design-methodology" class="level1">
<h1>Evaluation Design &amp; Methodology</h1>
<section id="q-why-do-you-recommend-binary-passfail-evaluations-instead-of-1-5-ratings-likert-scales" class="level2">
<h2 class="anchored" data-anchor-id="q-why-do-you-recommend-binary-passfail-evaluations-instead-of-1-5-ratings-likert-scales">Q: Why do you recommend binary (pass/fail) evaluations instead of 1-5 ratings (Likert scales)?</h2>
<blockquote class="blockquote">
<p>Engineers often believe that Likert scales (1-5 ratings) provide more information than binary evaluations, allowing them to track gradual improvements. However, this added complexity often creates more problems than it solves in practice.</p>
</blockquote>
<p>Binary evaluations force clearer thinking and more consistent labeling. Likert scales introduce significant challenges: the difference between adjacent points (like 3 vs 4) is subjective and inconsistent across annotators, detecting statistical differences requires larger sample sizes, and annotators often default to middle values to avoid making hard decisions.</p>
<p>Having binary options forces people to make a decision rather than hiding uncertainty in middle values. Binary decisions are also faster to make during error analysis - you don’t waste time debating whether something is a 3 or 4.</p>
<p>For tracking gradual improvements, consider measuring specific sub-components with their own binary checks rather than using a scale. For example, instead of rating factual accuracy 1-5, you could track “4 out of 5 expected facts included” as separate binary checks. This preserves the ability to measure progress while maintaining clear, objective criteria.</p>
<p>Start with binary labels to understand what ‘bad’ looks like. Numeric labels are advanced and usually not necessary.</p>
<p><a href="../../../blog/posts/evals-faq/why-do-you-recommend-binary-passfail-evaluations-instead-of-1-5-ratings-likert-scales.html" class="faq-individual-link">↗ Focus view</a></p>
</section>
<section id="q-should-i-practice-eval-driven-development" class="level2">
<h2 class="anchored" data-anchor-id="q-should-i-practice-eval-driven-development">Q: Should I practice eval-driven development?</h2>
<p><strong>Generally no.</strong> Eval-driven development (writing evaluators before implementing features) sounds appealing but creates more problems than it solves. Unlike traditional software where failure modes are predictable, LLMs have infinite surface area for potential failures. You can’t anticipate what will break.</p>
<p>A better approach is to start with error analysis. Write evaluators for errors you discover, not errors you imagine. This avoids getting blocked on what to evaluate and prevents wasted effort on metrics that have no impact on actual system quality.</p>
<p><strong>Exception:</strong> Eval-driven development may work for specific constraints where you know exactly what success looks like. If adding “never mention competitors,” writing that evaluator early may be acceptable.</p>
<p>Most importantly, always do a cost-benefit analysis before implementing an eval. Ask whether the failure mode justifies the investment. Error analysis reveals which failures actually matter for your users.</p>
<p><a href="../../../blog/posts/evals-faq/should-i-practice-eval-driven-development.html" class="faq-individual-link">↗ Focus view</a></p>
</section>
<section id="q-should-i-build-automated-evaluators-for-every-failure-mode-i-find" class="level2">
<h2 class="anchored" data-anchor-id="q-should-i-build-automated-evaluators-for-every-failure-mode-i-find">Q: Should I build automated evaluators for every failure mode I find?</h2>
<p>Focus automated evaluators on failures that persist after fixing your prompts. Many teams discover their LLM doesn’t meet preferences they never actually specified - like wanting short responses, specific formatting, or step-by-step reasoning. Fix these obvious gaps first before building complex evaluation infrastructure.</p>
<p>Consider the cost hierarchy of different evaluator types. Simple assertions and reference-based checks (comparing against known correct answers) are cheap to build and maintain. LLM-as-Judge evaluators require 100+ labeled examples, ongoing weekly maintenance, and coordination between developers, PMs, and domain experts. This cost difference should shape your evaluation strategy.</p>
<p>Only build expensive evaluators for problems you’ll iterate on repeatedly. Since LLM-as-Judge comes with significant overhead, save it for persistent generalization failures - not issues you can fix trivially. Start with cheap code-based checks where possible: regex patterns, structural validation, or execution tests. Reserve complex evaluation for subjective qualities that can’t be captured by simple rules.</p>
<p><a href="../../../blog/posts/evals-faq/should-i-build-automated-evaluators-for-every-failure-mode-i-find.html" class="faq-individual-link">↗ Focus view</a></p>
</section>
<section id="q-should-i-use-ready-to-use-evaluation-metrics" class="level2">
<h2 class="anchored" data-anchor-id="q-should-i-use-ready-to-use-evaluation-metrics">Q: Should I use "ready-to-use" evaluation metrics?</h2>
<p><strong>No.&nbsp;Generic evaluations waste time and create false confidence.</strong> (Unless you’re using them for exploration).</p>
<p>One instructor noted:</p>
<blockquote class="blockquote">
<p>“All you get from using these prefab evals is you don’t know what they actually do and in the best case they waste your time and in the worst case they create an illusion of confidence that is unjustified.”<sup>1</sup></p>
</blockquote>
<p>Generic evaluation metrics are everywhere. Eval libraries contain scores like helpfulness, coherence, quality, etc. promising easy evaluation. These metrics measure abstract qualities that may not matter for your use case. Good scores on them don’t mean your system works.</p>
<p>Instead, conduct error analysis to understand failures. Define binary failure modes based on real problems. Create custom evaluators for those failures and validate them against human judgment. Essentially, the entire evals process.</p>
<p>Experienced practitioners may still use these metrics, just not how you’d expect. As Picasso said: “Learn the rules like a pro, so you can break them like an artist.” Once you understand why generic metrics fail as evaluations, you can repurpose them as exploration tools to find interesting traces (explained in the next FAQ).</p>
<p><a href="../../../blog/posts/evals-faq/should-i-use-ready-to-use-evaluation-metrics.html" class="faq-individual-link">↗ Focus view</a></p>
</section>
<section id="q-are-similarity-metrics-bertscore-rouge-etc.-useful-for-evaluating-llm-outputs" class="level2">
<h2 class="anchored" data-anchor-id="q-are-similarity-metrics-bertscore-rouge-etc.-useful-for-evaluating-llm-outputs">Q: Are similarity metrics (BERTScore, ROUGE, etc.) useful for evaluating LLM outputs?</h2>
<p>Generic metrics like BERTScore, ROUGE, cosine similarity, etc. are not useful for evaluating LLM outputs in most AI applications. Instead, we recommend using error analysis to identify metrics specific to your application’s behavior. We recommend designing binary pass/fail.) evals (using LLM-as-judge) or code-based assertions.</p>
<p>As an example, consider a real estate CRM assistant. Suggesting showings that aren’t available (can be tested with an assertion) or confusing client personas (can be tested with a LLM-as-judge) is problematic . Generic metrics like similarity or verbosity won’t catch this. A relevant quote from the course:</p>
<blockquote class="blockquote">
<p>“The abuse of generic metrics is endemic. Many eval vendors promote off the shelf metrics, which ensnare engineers into superfluous tasks.”</p>
</blockquote>
<p>Similarity metrics aren’t always useless. They have utility in domains like search and recommendation (and therefore can be useful for optimizing and debugging retrieval for RAG). For example, cosine similarity between embeddings can measure semantic closeness in retrieval systems, and average pairwise similarity can assess output diversity (where lower similarity indicates higher diversity).</p>
<p><a href="../../../blog/posts/evals-faq/are-similarity-metrics-bertscore-rouge-etc-useful-for-evaluating-llm-outputs.html" class="faq-individual-link">↗ Focus view</a></p>
</section>
<section id="q-can-i-use-the-same-model-for-both-the-main-task-and-evaluation" class="level2">
<h2 class="anchored" data-anchor-id="q-can-i-use-the-same-model-for-both-the-main-task-and-evaluation">Q: Can I use the same model for both the main task and evaluation?</h2>
<p>For LLM-as-Judge selection, using the same model is usually fine because the judge is doing a different task than your main LLM pipeline. While <a href="https://arxiv.org/pdf/2508.06709">research has shown</a> that models can exhibit bias when evaluating their own outputs, what ultimately matters is how well your judge aligns with human judgments. The judges we recommend building do scoped binary classification tasks. We’ve found that iterative alignment with human labels is usually achievable on this constrained task.</p>
<p>Focus on achieving high True Positive Rate (TPR) and True Negative Rate (TNR) with your judge on a held out labeled test set. If you struggle to achieve good alignment with human scores, then consider trying a different model. However onboarding new model providers may involve non-trivial effort in some organizations, which is why we don’t advocate for using different models by default unless there’s a specific alignment issue.</p>
<p>When selecting judge models, start with the most capable models available to establish strong alignment with human judgments. You can optimize for cost later once you’ve established reliable evaluation criteria.</p>
<p><a href="../../../blog/posts/evals-faq/can-i-use-the-same-model-for-both-the-main-task-and-evaluation.html" class="faq-individual-link">↗ Focus view</a></p>
</section>
<section id="q-how-do-we-evaluate-a-models-ability-to-express-uncertainty-or-know-what-it-doesnt-know" class="level2">
<h2 class="anchored" data-anchor-id="q-how-do-we-evaluate-a-models-ability-to-express-uncertainty-or-know-what-it-doesnt-know">Q: How do we evaluate a model’s ability to express uncertainty or "know what it doesn’t know"?</h2>
<p>Many applications require a model that can refuse to answer a question when it lacks sufficient information. To evaluate whether this refusal behavior is well-calibrated, you need to test if the model refuses at the appropriate times without refusing to answer questions it <em>should</em> be able to answer.</p>
<p>To do this effectively, you should construct an evaluation set that has the following components:</p>
<ol type="1">
<li><strong>Answerable Questions:</strong> Scenarios where a correct, verifiable answer is present in the model’s provided context or general knowledge.</li>
<li><strong>Unanswerable Questions:</strong> Scenarios designed to tempt the model to hallucinate. These include questions with false premises, queries about information explicitly missing from context, or topics far outside its knowledge base.</li>
</ol>
<p>While the exact proportion isn’t critical, a balanced set with a roughly equal number of answerable and unanswerable questions is a good starting point. The diversity and difficulty of the questions are more important than the precise ratio.</p>
<p>The evaluation itself is a binary (Pass/Fail) check of the model’s judgment. A “Pass” requires the model to satisfy two conditions: it must answer the answerable questions while also refusing to answer the unanswerable ones. A failure is defined as providing a fabricated answer to an unanswerable question, which indicates poor calibration.</p>
<p>In the research literature, this capability is known as “Abstention Ability.” To improve this behavior, it is worth <a href="https://arxiv.org/search/?query=Abstention+Ability&amp;searchtype=all">searching for this term on Arxiv</a> to understand the latest techniques.</p>
<p><a href="../../../blog/posts/evals-faq/how-do-we-evaluate-a-models-ability-to-express-uncertainty-or-know-what-it-doesnt-know.html" class="faq-individual-link">↗ Focus view</a></p>
<hr>
<div class="cta" style="text-align: center;">
<p><strong>👉 <em>Want to learn more about AI Evals? Check out our <a href="https://bit.ly/evals-ai" target="_blank">AI Evals course</a></em></strong>. It’s a live cohort with hands on exercises and office hours. Here is a <a href="https://bit.ly/evals-ai" target="_blank">25% discount code</a> for readers. 👈</p>
</div>
<hr>
</section>
</section>
<section id="human-annotation-process" class="level1">
<h1>Human Annotation &amp; Process</h1>
<section id="q-how-many-people-should-annotate-my-llm-outputs" class="level2">
<h2 class="anchored" data-anchor-id="q-how-many-people-should-annotate-my-llm-outputs">Q: How many people should annotate my LLM outputs?</h2>
<p>For most small to medium-sized companies, appointing a single domain expert as a “benevolent dictator” is the most effective approach. This person—whether it’s a psychologist for a mental health chatbot, a lawyer for legal document analysis, or a customer service director for support automation—becomes the definitive voice on quality standards.</p>
<p>A single expert eliminates annotation conflicts and prevents the paralysis that comes from “too many cooks in the kitchen”. The benevolent dictator can incorporate input and feedback from others, but they drive the process. If you feel like you need five subject matter experts to judge a single interaction, it’s a sign your product scope might be too broad.</p>
<p>However, larger organizations or those operating across multiple domains (like a multinational company with different cultural contexts) may need multiple annotators. When you do use multiple people, you’ll need to measure their agreement using metrics like Cohen’s Kappa, which accounts for agreement beyond chance. However, use your judgment. Even in larger companies, a single expert is often enough.</p>
<p>Start with a benevolent dictator whenever feasible. Only add complexity when your domain demands it.</p>
<p><a href="../../../blog/posts/evals-faq/how-many-people-should-annotate-my-llm-outputs.html" class="faq-individual-link">↗ Focus view</a></p>
</section>
<section id="q-should-product-managers-and-engineers-collaborate-on-error-analysis-how" class="level2">
<h2 class="anchored" data-anchor-id="q-should-product-managers-and-engineers-collaborate-on-error-analysis-how">Q: Should product managers and engineers collaborate on error analysis? How?</h2>
<p>At the outset, collaborate to establish shared context. Engineers catch technical issues like retrieval issues and tool errors. PMs identify product failures like unmet user expectations, confusing responses, or missing features users expect.</p>
<p>As time goes on you should lean towards a benevolent dictator for error analysis: a domain expert or PM who understands user needs. Empower domain experts to evaluate actual outcomes rather than technical implementation. Ask “Has an appointment been made?” not “Did the tool call succeed?” The best way to empower the domain expert is to give them custom annotation tools that display system outcomes alongside traces. Show the confirmation, generated email, or database update that validates goal completion. Keep all context on one screen so non-technical reviewers focus on results.</p>
<p><a href="../../../blog/posts/evals-faq/should-product-managers-and-engineers-collaborate-on-error-analysis-how.html" class="faq-individual-link">↗ Focus view</a></p>
</section>
<section id="q-should-i-outsource-annotation-labeling-to-a-third-party" class="level2">
<h2 class="anchored" data-anchor-id="q-should-i-outsource-annotation-labeling-to-a-third-party">Q: Should I outsource annotation &amp; labeling to a third party?</h2>
<p>Outsourcing error analysis is usually a big mistake (with some exceptions). The core of evaluation is building the product intuition that only comes from systematically analyzing your system’s failures. You should be extremely skeptical of this process being delegated.</p>
<section id="the-dangers-of-outsourcing" class="level3">
<h3 class="anchored" data-anchor-id="the-dangers-of-outsourcing"><strong>The Dangers of Outsourcing</strong></h3>
<p>When you outsource annotation, you often break the feedback loop between observing a failure and understanding how to improve the product. Problems with outsourcing include:</p>
<ul>
<li>Superficial Labeling: Even well-defined metrics require nuanced judgment that external teams lack. A critical misstep in error analysis is excluding domain experts from the labeling process. Outsourcing this task to those without domain expertise, like general developers or IT staff, often leads to superficial or incorrect labeling.<br>
</li>
<li>Loss of Unspoken Knowledge: A principal domain expert possesses tacit knowledge and user understanding that cannot be fully captured in a rubric. Involving these experts helps uncover their preferences and expectations, which they might not be able to fully articulate upfront.<br>
</li>
<li>Annotation Conflicts and Misalignment: Without a shared context, external annotators can create more disagreement than they resolve. Achieving alignment is a challenge even for internal teams, which means you will spend even more time on this process.</li>
</ul>
</section>
<section id="the-recommended-approach-build-internal-capability" class="level3">
<h3 class="anchored" data-anchor-id="the-recommended-approach-build-internal-capability"><strong>The Recommended Approach: Build Internal Capability</strong></h3>
<p>Instead of outsourcing, focus on building an efficient internal evaluation process.</p>
<p>1. Appoint a “Benevolent Dictator”. For most teams, the most effective strategy is to appoint a single, internal domain expert as the final decision-maker on quality. This individual sets the standard, ensures consistency, and develops a sense of ownership.</p>
<p>2. Use a collaborative workflow for multiple annotators. If multiple annotators are necessary, follow a structured process to ensure alignment: * Draft an initial rubric with clear Pass/Fail definitions and examples. * Have each annotator label a shared set of traces independently to surface differences in interpretation. * Measure Inter-Annotator Agreement (IAA) using a chance-corrected metric like Cohen’s Kappa. * Facilitate alignment sessions to discuss disagreements and refine the rubric. * Iterate on this process until agreement is consistently high.</p>
</section>
<section id="how-to-handle-capacity-constraints" class="level3">
<h3 class="anchored" data-anchor-id="how-to-handle-capacity-constraints"><strong>How to Handle Capacity Constraints</strong></h3>
<p>Building internal capacity does not mean you have to label every trace. Use these strategies to manage the workload:</p>
<ul>
<li>Smart Sampling: Review a small, representative sample of traces thoroughly. It is more effective to analyze 100 diverse traces to find patterns than to superficially label thousands.<br>
</li>
<li>The “Think-Aloud” Protocol: To make the most of limited expert time, use this technique from usability testing. Ask an expert to verbalize their thought process while reviewing a handful of traces. This method can uncover deep insights in a single one-hour session.<br>
</li>
<li>Build Lightweight Custom Tools: Build custom annotation tools to streamline the review process, increasing throughput.</li>
</ul>
</section>
<section id="exceptions-for-external-help" class="level3">
<h3 class="anchored" data-anchor-id="exceptions-for-external-help"><strong>Exceptions for External Help</strong></h3>
<p>While outsourcing the core error analysis process is not recommended, there are some scenarios where external help is appropriate:</p>
<ul>
<li>Purely Mechanical Tasks: For highly objective, unambiguous tasks like identifying a phone number or validating an email address, external annotators can be used after a rigorous internal process has defined the rubric.<br>
</li>
<li>Tasks Without Product Context: Well-defined tasks that don’t require understanding your product’s specific requirements can be outsourced. Translation is a good example: it requires linguistic expertise but not deep product knowledge.<br>
</li>
<li>Engaging Subject Matter Experts: Hiring external SMEs to act as your internal domain experts is not outsourcing; it is bringing the necessary expertise into your evaluation process. For example, <a href="https://www.ankihub.net/" target="_blank">AnkiHub</a> hired 4th-year medical students to evaluate their RAG systems for medical content rather than outsourcing to generic annotators.</li>
</ul>
<p><a href="../../../blog/posts/evals-faq/should-i-outsource-annotation-and-labeling-to-a-third-party.html" class="faq-individual-link">↗ Focus view</a></p>
</section>
</section>
<section id="q-what-parts-of-evals-can-be-automated-with-llms" class="level2">
<h2 class="anchored" data-anchor-id="q-what-parts-of-evals-can-be-automated-with-llms">Q: What parts of evals can be automated with LLMs?</h2>
<p>LLMs can speed up parts of your eval workflow, but they can’t replace human judgment where your expertise is essential. For example, if you let an LLM handle all of <a href="../../../blog/posts/evals-faq/why-is-error-analysis-so-important-in-llm-evals-and-how-is-it-performed.html" target="_blank">error analysis</a> (i.e., reviewing and annotating traces), you might overlook failure cases that matter for your product. Suppose users keep mentioning “lag” in feedback, but the LLM lumps these under generic “performance issues” instead of creating a “latency” category. You’d miss a recurring complaint about slow response times and fail to prioritize a fix.</p>
<p>That said, LLMs are valuable tools for accelerating certain parts of the evaluation workflow <em>when used with oversight</em>.</p>
<section id="here-are-some-areas-where-llms-can-help" class="level3">
<h3 class="anchored" data-anchor-id="here-are-some-areas-where-llms-can-help">Here are some areas where LLMs can help:</h3>
<ul>
<li><strong>First-pass axial coding:</strong> After you’ve open coded 30–50 traces yourself, use an LLM to organize your raw failure notes into proposed groupings. This helps you quickly spot patterns, but always review and refine the clusters yourself. <em>Note: If you aren’t familiar with axial and open coding, see <a href="../../../blog/posts/evals-faq/why-is-error-analysis-so-important-in-llm-evals-and-how-is-it-performed.html" target="_blank">this faq</a>.</em></li>
<li><strong>Mapping annotations to failure modes:</strong> Once you’ve defined failure categories, you can ask an LLM to suggest which categories apply to each new trace (e.g., “Given this annotation: [open_annotation] and these failure modes: [list_of_failure_modes], which apply?”).<br>
</li>
<li><strong>Suggesting prompt improvements:</strong> When you notice recurring problems, have the LLM propose concrete changes to your prompts. Review these suggestions before adopting any changes.<br>
</li>
<li><strong>Analyzing annotation data:</strong> Use LLMs or AI-powered notebooks to find patterns in your labels, such as “reports of lag increase 3x during peak usage hours” or “slow response times are mostly reported from users on mobile devices.”</li>
</ul>
</section>
<section id="however-you-shouldnt-outsource-these-activities-to-an-llm" class="level3">
<h3 class="anchored" data-anchor-id="however-you-shouldnt-outsource-these-activities-to-an-llm">However, you shouldn’t outsource these activities to an LLM:</h3>
<ul>
<li><strong>Initial open coding:</strong> Always read through the raw traces yourself at the start. This is how you discover new types of failures, understand user pain points, and build intuition about your data. Never skip this or delegate it.<br>
</li>
<li><strong>Validating failure taxonomies:</strong> LLM-generated groupings need your review. For example, an LLM might group both “app crashes after login” and “login takes too long” under a single “login issues” category, even though one is a stability problem and the other is a performance problem. Without your intervention, you’d miss that these issues require different fixes.<br>
</li>
<li><strong>Ground truth labeling:</strong> For any data used for testing/validating LLM-as-Judge evaluators, hand-validate each label. LLMs can make mistakes that lead to unreliable benchmarks.<br>
</li>
<li><strong>Root cause analysis:</strong> LLMs may point out obvious issues, but only human review will catch patterns like errors that occur in specific workflows or edge cases—such as bugs that happen only when users paste data from Excel.</li>
</ul>
<p>In conclusion, start by examining data manually to understand what’s actually going wrong. Use LLMs to scale what you’ve learned, not to avoid looking at data.</p>
<p><a href="../../../blog/posts/evals-faq/what-parts-of-evals-can-be-automated-with-llms.html" class="faq-individual-link">↗ Focus view</a></p>
</section>
</section>
<section id="q-should-i-stop-writing-prompts-manually-in-favor-of-automated-tools" class="level2">
<h2 class="anchored" data-anchor-id="q-should-i-stop-writing-prompts-manually-in-favor-of-automated-tools">Q: Should I stop writing prompts manually in favor of automated tools?</h2>
<p>Automating prompt engineering can be tempting, but you should be skeptical of tools that promise to optimize prompts for you, especially in early stages of development. When you write a prompt, you are forced to clarify your assumptions and externalize your requirements. Good writing is good thinking <sup>2</sup>. If you delegate this task to an automated tool too early, you risk never fully understanding your own requirements or the model’s failure modes.</p>
<p>This is because automated prompt optimization typically hill-climb a predefined evaluation metric. It can refine a prompt to perform better on known failures, but it cannot discover <em>new</em> ones. Discovering new errors requires error analysis. Furthermore, research shows that evaluation criteria tends to shift after reviewing a model’s outputs, a phenomenon known as “criteria drift” <sup>3</sup>. This means that evaluation is an iterative, human-driven sensemaking process, not a static target that can be set once and handed off to an optimizer.</p>
<p>A pragmatic approach is to use LLMs to improve your prompt based on open coding (open-ended notes about traces). This way, you maintain a human in the loop who is looking at the data and externalizing their requirements. Once you have a high-quality set of evals, prompt optimization can be effective for that last mile of performance.</p>
<p><a href="../../../blog/posts/evals-faq/should-i-stop-writing-prompts-manually-in-favor-of-automated-tools.html" class="faq-individual-link">↗ Focus view</a></p>
<hr>
<div class="cta" style="text-align: center;">
<p><strong>👉 <em>Want to learn more about AI Evals? Check out our <a href="https://bit.ly/evals-ai" target="_blank">AI Evals course</a></em></strong>. It’s a live cohort with hands on exercises and office hours. Here is a <a href="https://bit.ly/evals-ai" target="_blank">25% discount code</a> for readers. 👈</p>
</div>
<hr>
</section>
</section>
<section id="tools-infrastructure" class="level1">
<h1>Tools &amp; Infrastructure</h1>
<section id="q-should-i-build-a-custom-annotation-tool-or-use-something-off-the-shelf" class="level2">
<h2 class="anchored" data-anchor-id="q-should-i-build-a-custom-annotation-tool-or-use-something-off-the-shelf">Q: Should I build a custom annotation tool or use something off-the-shelf?</h2>
<p><strong>Build a custom annotation tool.</strong> This is the single most impactful investment you can make for your AI evaluation workflow. With AI-assisted development tools like Cursor or Lovable, you can build a tailored interface in hours. I often find that teams with custom annotation tools iterate ~10x faster.</p>
<p>Custom tools excel because:</p>
<ul>
<li>They show all your context from multiple systems in one place</li>
<li>They can render your data in a product specific way (images, widgets, markdown, buttons, etc.)</li>
<li>They’re designed for your specific workflow (custom filters, sorting, progress bars, etc.)</li>
</ul>
<p>Off-the-shelf tools may be justified when you need to coordinate dozens of distributed annotators with enterprise access controls. Even then, many teams find the configuration overhead and limitations aren’t worth it.</p>
<p><a href="https://youtu.be/fA4pe9bE0LY" target="_blank">Isaac’s Anki flashcard annotation app</a> shows the power of custom tools—handling 400+ results per query with keyboard navigation and domain-specific evaluation criteria that would be nearly impossible to configure in a generic tool.</p>
<div class="quarto-video ratio ratio-16x9"><iframe data-external="1" src="https://www.youtube.com/embed/fA4pe9bE0LY" title="" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen=""></iframe></div>
<p><a href="../../../blog/posts/evals-faq/should-i-build-a-custom-annotation-tool-or-use-something-off-the-shelf.html" class="faq-individual-link">↗ Focus view</a></p>
</section>
<section id="q-what-makes-a-good-custom-interface-for-reviewing-llm-outputs" class="level2">
<h2 class="anchored" data-anchor-id="q-what-makes-a-good-custom-interface-for-reviewing-llm-outputs">Q: What makes a good custom interface for reviewing LLM outputs?</h2>
<p>Great interfaces make human review fast, clear, and motivating. We recommend building your own annotation tool customized to your domain. The following features are possible enhancements we’ve seen work well, but you don’t need all of them. The screenshots shown are illustrative examples to clarify concepts. In practice, I rarely implement all these features in a single app. It’s ultimately a judgment call based on your specific needs and constraints.</p>
<section id="render-traces-intelligently-not-generically" class="level3">
<h3 class="anchored" data-anchor-id="render-traces-intelligently-not-generically"><strong>1. Render Traces Intelligently, Not Generically</strong>:</h3>
<p>Present the trace in a way that’s intuitive for the domain. If you’re evaluating generated emails, render them to look like emails. If the output is code, use syntax highlighting. Allow the reviewer to see the full trace (user input, tool calls, and LLM reasoning), but keep less important details in collapsed sections that can be expanded. Here is an example of a custom annotation tool for reviewing real estate assistant emails:</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://hamel.dev/blog/posts/evals-faq/images/emailinterface1.png" class="img-fluid figure-img" style="width:75.0%" target="_blank"></p>
<figcaption>A custom interface for reviewing emails for a real estate assistant.</figcaption>
</figure>
</div>
</section>
<section id="show-progress-and-support-keyboard-navigation" class="level3">
<h3 class="anchored" data-anchor-id="show-progress-and-support-keyboard-navigation"><strong>2. Show Progress and Support Keyboard Navigation</strong>:</h3>
<p>Keep reviewers in a state of flow by minimizing friction and motivating completion. Include progress indicators (e.g., “Trace 45 of 100”) to keep the review session bounded and encourage completion. Enable hotkeys for navigating between traces (e.g., N for next), applying labels, and saving notes quickly. Below is an illustration of these features:</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://hamel.dev/blog/posts/evals-faq/images/hotkey.png" class="img-fluid figure-img" style="width:75.0%" target="_blank"></p>
<figcaption>An annotation interface with a progress bar and hotkey guide</figcaption>
</figure>
</div>
</section>
<section id="trace-navigation-through-clustering-filtering-and-search" class="level3">
<h3 class="anchored" data-anchor-id="trace-navigation-through-clustering-filtering-and-search"><strong>3. Trace navigation through clustering, filtering, and search</strong>:</h3>
<p>Allow reviewers to filter traces by metadata or search by keywords. Semantic search helps find conceptually similar problems. Clustering similar traces (like grouping by user persona) lets reviewers spot recurring issues and explore hypotheses. Below is an illustration of these features:</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://hamel.dev/blog/posts/evals-faq/images/group1.png" class="img-fluid figure-img" style="width:75.0%" target="_blank"></p>
<figcaption>Cluster view showing groups of emails, such as property-focused or client-focused examples. Reviewers can drill into a group to see individual traces.</figcaption>
</figure>
</div>
</section>
<section id="prioritize-labeling-traces-you-think-might-be-problematic" class="level3">
<h3 class="anchored" data-anchor-id="prioritize-labeling-traces-you-think-might-be-problematic"><strong>4. Prioritize labeling traces you think might be problematic</strong>:</h3>
<p>Surface traces flagged by guardrails, CI failures, or automated evaluators for review. Provide buttons to take actions like adding to datasets, filing bugs, or re-running pipeline tests. Display relevant context (pipeline version, eval scores, reviewer info) directly in the interface to minimize context switching. Below is an illustration of these ideas:</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://hamel.dev/blog/posts/evals-faq/images/ci.png" class="img-fluid figure-img" style="width:88.0%" target="_blank"></p>
<figcaption>A trace view that allows you to quickly see auto-evaluator verdict, add traces to dataset or open issues. Also shows metadata like pipeline version, reviewer info, and more.</figcaption>
</figure>
</div>
</section>
<section id="general-principle-keep-it-minimal" class="level3">
<h3 class="anchored" data-anchor-id="general-principle-keep-it-minimal">General Principle: Keep it minimal</h3>
<p>Keep your annotation interface minimal. Only incorporate these ideas if they provide a benefit that outweighs the additional complexity and maintenance overhead.</p>
<p><a href="../../../blog/posts/evals-faq/what-makes-a-good-custom-interface-for-reviewing-llm-outputs.html" class="faq-individual-link">↗ Focus view</a></p>
</section>
</section>
<section id="q-what-gaps-in-eval-tooling-should-i-be-prepared-to-fill-myself" class="level2">
<h2 class="anchored" data-anchor-id="q-what-gaps-in-eval-tooling-should-i-be-prepared-to-fill-myself">Q: What gaps in eval tooling should I be prepared to fill myself?</h2>
<p>Most eval tools handle the basics well: logging complete traces, tracking metrics, prompt playgrounds, and annotation queues. These are table stakes. Here are four areas where you’ll likely need to supplement existing tools.</p>
<p>Watch for vendors addressing these gaps: it’s a strong signal they understand practitioner needs.</p>
<section id="error-analysis-and-pattern-discovery" class="level3">
<h3 class="anchored" data-anchor-id="error-analysis-and-pattern-discovery">1. Error Analysis and Pattern Discovery</h3>
<p>After reviewing traces where your AI fails, can your tooling automatically cluster similar issues? For instance, if multiple traces show the assistant using casual language for luxury clients, you need something that recognizes this broader “persona-tone mismatch” pattern. We recommend building capabilities that use AI to suggest groupings, rewrite your observations into clearer failure taxonomies, help find similar cases through semantic search, etc.</p>
</section>
<section id="ai-powered-assistance-throughout-the-workflow" class="level3">
<h3 class="anchored" data-anchor-id="ai-powered-assistance-throughout-the-workflow">2. AI-Powered Assistance Throughout the Workflow</h3>
<p>The most effective workflows use AI to accelerate every stage of evaluation. During error analysis, you want an LLM helping categorize your open-ended observations into coherent failure modes. For example, you might annotate several traces with notes like “wrong tone for investor,” “too casual for luxury buyer,” etc. Your tooling should recognize these as the same underlying pattern and suggest a unified “persona-tone mismatch” category.</p>
<p>You’ll also want AI assistance in proposing fixes. After identifying 20 cases where your assistant omits pet policies from property summaries, can your workflow analyze these failures and suggest specific prompt modifications? Can it draft refinements to your SQL generation instructions when it notices patterns of missing WHERE clauses?</p>
<p>Additionally, good workflows help you conduct data analysis of your annotations and traces. I like using notebooks with AI in-the-loop like <a href="https://julius.ai/" target="_blank">Julius</a>,<a href="https://hex.tech" target="_blank">Hex</a> or <a href="https://solveit.fast.ai/" target="_blank">SolveIt</a>. These help me discover insights like “location ambiguity errors spike 3x when users mention neighborhood names” or “tone mismatches occur 80% more often in email generation than other modalities.”</p>
</section>
<section id="custom-evaluators-over-generic-metrics" class="level3">
<h3 class="anchored" data-anchor-id="custom-evaluators-over-generic-metrics">3. Custom Evaluators Over Generic Metrics</h3>
<p>Be prepared to build most of your evaluators from scratch. Generic metrics like “hallucination score” or “helpfulness rating” rarely capture what actually matters for your application—like proposing unavailable showing times or omitting budget constraints from emails. In our experience, successful teams spend most of their effort on application-specific metrics.</p>
</section>
<section id="apis-that-support-custom-annotation-apps" class="level3">
<h3 class="anchored" data-anchor-id="apis-that-support-custom-annotation-apps">4. APIs That Support Custom Annotation Apps</h3>
<p>Custom annotation interfaces work best for most teams. This requires observability platforms with thoughtful APIs. I often have to build my own libraries and abstractions just to make bulk data export manageable. You shouldn’t have to paginate through thousands of requests or handle timeout-prone endpoints just to get your data. Look for platforms that provide true bulk export capabilities and, crucially, APIs that let you write annotations back efficiently.</p>
<p><a href="../../../blog/posts/evals-faq/what-gaps-in-eval-tooling-should-i-be-prepared-to-fill-myself.html" class="faq-individual-link">↗ Focus view</a></p>
</section>
</section>
<section id="q-whats-your-favorite-eval-vendor" class="level2">
<h2 class="anchored" data-anchor-id="q-whats-your-favorite-eval-vendor">Q: What’s your favorite eval vendor?</h2>
<p>Eval tools are in an intensely competitive space. It would be futile to compare their features. If I tried to do such an analysis, it would be invalidated in a week! Vendors I encounter the most organically in my work are: <a href="https://www.langchain.com/langsmith" target="_blank">Langsmith</a>, <a href="https://arize.com/" target="_blank">Arize</a> and <a href="https://www.braintrust.dev/" target="_blank">Braintrust</a>.</p>
<p>When I help clients with vendor selection, the decision weighs heavily towards who can offer the best support, as opposed to purely features. This changes depending on size of client, use case, etc. Yes - it’s mainly the human factor that matters, and dare I say, vibes.</p>
<p>I have no favorite vendor. At the core, their features are very similar - and I often build <a href="https://hamel.dev/blog/posts/evals/#q-should-i-build-a-custom-annotation-tool-or-use-something-off-the-shelf" target="_blank">custom tools</a> on top of them to fit my needs.</p>
<p>Here is a <a href="../../../blog/posts/eval-tools/">video series</a> that has a live commentary on the relative strengths and weaknesses of the three aforementioned vendors.</p>
<p><a href="../../../blog/posts/evals-faq/whats-your-favorite-eval-vendor.html" class="faq-individual-link">↗ Focus view</a></p>
</section>
<section id="q-how-should-i-version-and-manage-prompts" class="level2">
<h2 class="anchored" data-anchor-id="q-how-should-i-version-and-manage-prompts">Q: How should I version and manage prompts?</h2>
<p>There is an unavoidable tension between keeping prompts close to the code vs.&nbsp;an environment that non-technical stakeholders can access.</p>
<p><strong>My preferred approach is storing prompts in Git.</strong> This treats them as software artifacts that are versioned, reviewed, and deployed atomically with the application code. While the Git command line is unfriendly for non-technical folks, the <a href="https://github.com">GitHub</a> web interface and the GitHub <a href="https://desktop.github.com/">Desktop app</a> make it very approachable. When I was working at GitHub, I worked with many non-technical professionals, including lawyers and accountants, who used these tools effectively. Here is a <a href="https://ben.balter.com/2023/03/02/github-for-non-technical-roles/">blog post</a> aimed at non-technical folks to get started.</p>
<p>Alternatively, most vendors in the LLM tooling space, such as observability platforms like Arize, Braintrust, and LangSmith, offer dedicated prompt management tools. These are accessible for rapid iteration but risk creating additional layers of indirection.</p>
<p><strong>Why prompt management tools often fall short:</strong> AI products typically involve many moving parts: tools, RAG, agents, etc. Prompt management tools are inherently limiting because they can’t easily execute your application’s code. Even when they can, there’s often significant indirection involved, making it difficult to test prompts with your system’s capabilities.</p>
<p><strong>When possible, a notebook provides a great solution for prompt experimentation</strong> If you have Python entry points into your codebase or your codebase is written in Python, Jupyter notebooks are particularly powerful for this purpose. You can experiment with prompts and iterate on your actual AI agents with their full tool and RAG capabilities. This makes it much easier to understand how your system works in practice. Additionally, you can create widgets and small user interfaces within notebooks, giving you the best of both worlds for experimentation and iteration. To see what this looks like in practice, Teresa Torres gives a fantastic, hands-on walkthrough of how she, as a PM, used notebooks for the entire eval and experimentation lifecycle:</p>
<div class="quarto-video ratio ratio-16x9"><iframe data-external="1" src="https://www.youtube.com/embed/N-qAOv_PNPc" title="" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen=""></iframe></div>
<p>If notebooks are not feasible for your code base, an <a href="../../../blog/posts/field-guide/#build-bridges-not-gatekeepers">​integrated prompt environment</a>​ can be effective for experimentation. Either way, I prefer to version and manage prompts in Git.</p>
<p><a href="../../../blog/posts/evals-faq/how-should-i-version-and-manage-prompts.html" class="faq-individual-link">↗ Focus view</a></p>
<hr>
<div class="cta" style="text-align: center;">
<p><strong>👉 <em>Want to learn more about AI Evals? Check out our <a href="https://bit.ly/evals-ai" target="_blank">AI Evals course</a></em></strong>. It’s a live cohort with hands on exercises and office hours. Here is a <a href="https://bit.ly/evals-ai" target="_blank">25% discount code</a> for readers. 👈</p>
</div>
<hr>
</section>
</section>
<section id="production-deployment" class="level1">
<h1>Production &amp; Deployment</h1>
<section id="q-how-are-evaluations-used-differently-in-cicd-vs.-monitoring-production" class="level2">
<h2 class="anchored" data-anchor-id="q-how-are-evaluations-used-differently-in-cicd-vs.-monitoring-production">Q: How are evaluations used differently in CI/CD vs.&nbsp;monitoring production?</h2>
<p>The most important difference between CI vs.&nbsp;production evaluation is the data used for testing.</p>
<p>Test datasets for CI are small (in many cases 100+ examples) and purpose-built. Examples cover core features, regression tests for past bugs, and known edge cases. Since CI tests are run frequently, the cost of each test has to be carefully considered (that’s why you carefully curate the dataset). Favor assertions or other deterministic checks over LLM-as-judge evaluators.</p>
<p>For evaluating production traffic, you can sample live traces and run evaluators against them asynchronously. Since you usually lack reference outputs on production data, you might rely more on on more expensive reference-free evaluators like LLM-as-judge. Additionally, track confidence intervals for production metrics. If the lower bound crosses your threshold, investigate further.</p>
<p>These two systems are complementary: when production monitoring reveals new failure patterns through error analysis and evals, add representative examples to your CI dataset. This mitigates regressions on new issues.</p>
<p><a href="../../../blog/posts/evals-faq/how-are-evaluations-used-differently-in-cicd-vs-monitoring-production.html" class="faq-individual-link">↗ Focus view</a></p>
</section>
<section id="q-whats-the-difference-between-guardrails-evaluators" class="level2">
<h2 class="anchored" data-anchor-id="q-whats-the-difference-between-guardrails-evaluators">Q: What’s the difference between guardrails &amp; evaluators?</h2>
<p>Guardrails are <strong>inline safety checks</strong> that sit directly in the request/response path. They validate inputs or outputs <em>before</em> anything reaches a user, so they typically are:</p>
<ul>
<li><strong>Fast and deterministic</strong> – typically a few milliseconds of latency budget.</li>
<li><strong>Simple and explainable</strong> – regexes, keyword block-lists, schema or type validators, lightweight classifiers.</li>
<li><strong>Targeted at clear-cut, high-impact failures</strong> – PII leaks, profanity, disallowed instructions, SQL injection, malformed JSON, invalid code syntax, etc.</li>
</ul>
<p>If a guardrail triggers, the system can redact, refuse, or regenerate the response. Because these checks are user-visible when they fire, false positives are treated as production bugs; teams version guardrail rules, log every trigger, and monitor rates to keep them conservative.</p>
<p>On the other hand, evaluators typically run <strong>after</strong> a response is produced. Evaluators measure qualities that simple rules cannot, such as factual correctness, completeness, etc. Their verdicts feed dashboards, regression tests, and model-improvement loops, but they do not block the original answer.</p>
<p>Evaluators are usually run asynchronously or in batch to afford heavier computation such as a <a href="https://hamel.dev/blog/posts/llm-judge/" target="_blank">LLM-as-a-Judge</a>. Inline use of an LLM-as-Judge is possible <em>only</em> when the latency budget and reliability targets allow it. Slow LLM judges might be feasible in a cascade that runs on the minority of borderline cases.</p>
<p>Apply guardrails for immediate protection against objective failures requiring intervention. Use evaluators for monitoring and improving subjective or nuanced criteria. Together, they create layered protection.</p>
<p>Word of caution: Do not use llm guardrails off the shelf blindly. Always <a href="https://hamel.dev/blog/posts/prompt/" target="_blank">look at the prompt</a>.</p>
<p><a href="../../../blog/posts/evals-faq/whats-the-difference-between-guardrails-evaluators.html" class="faq-individual-link">↗ Focus view</a></p>
</section>
<section id="q-can-my-evaluators-also-be-used-to-automatically-fix-or-correct-outputs-in-production" class="level2">
<h2 class="anchored" data-anchor-id="q-can-my-evaluators-also-be-used-to-automatically-fix-or-correct-outputs-in-production">Q: Can my evaluators also be used to automatically <em>fix</em> or <em>correct</em> outputs in production?</h2>
<p>Yes, but only a specific subset of them. This is the distinction between an <strong>evaluator</strong> and a <strong>guardrail</strong> that we previously discussed. As a reminder:</p>
<ul>
<li><strong>Evaluators</strong> typically run <em>asynchronously</em> after a response has been generated. They measure quality but don’t interfere with the user’s immediate experience.<br>
</li>
<li><strong>Guardrails</strong> run <em>synchronously</em> in the critical path of the request, before the output is shown to the user. Their job is to prevent high-impact failures in real-time.</li>
</ul>
<p>There are two important decision criteria for deciding whether to use an evaluator as a guardrail:</p>
<ol type="1">
<li><p><strong>Latency &amp; Cost</strong>: Can the evaluator run fast enough and cheaply enough in the critical request path without degrading user experience?</p></li>
<li><p><strong>Error Rate Trade-offs</strong>: What’s the cost-benefit balance between false positives (blocking good outputs and frustrating users) versus false negatives (letting bad outputs reach users and causing harm)? In high-stakes domains like medical advice, false negatives may be more costly than false positives. In creative applications, false positives that block legitimate creativity may be more harmful than occasional quality issues.</p></li>
</ol>
<p>Most guardrails are designed to be <strong>fast</strong> (to avoid harming user experience) and have a <strong>very low false positive rate</strong> (to avoid blocking valid responses). For this reason, you would almost never use a slow or non-deterministic LLM-as-Judge as a synchronous guardrail. However, these tradeoffs might be different for your use case.</p>
<p><a href="../../../blog/posts/evals-faq/can-my-evaluators-also-be-used-to-automatically-fix-or-correct-outputs-in-production.html" class="faq-individual-link">↗ Focus view</a></p>
</section>
<section id="q-how-much-time-should-i-spend-on-model-selection" class="level2">
<h2 class="anchored" data-anchor-id="q-how-much-time-should-i-spend-on-model-selection">Q: How much time should I spend on model selection?</h2>
<p>Many developers fixate on model selection as the primary way to improve their LLM applications. Start with error analysis to understand your failure modes before considering model switching. As Hamel noted in office hours, “I suggest not thinking of switching model as the main axes of how to improve your system off the bat without evidence. Does error analysis suggest that your model is the problem?”</p>
<p><a href="../../../blog/posts/evals-faq/how-much-time-should-i-spend-on-model-selection.html" class="faq-individual-link">↗ Focus view</a></p>
</section>
</section>
<section id="domain-specific-applications" class="level1">
<h1>Domain-Specific Applications</h1>
<section id="q-is-rag-dead" class="level2">
<h2 class="anchored" data-anchor-id="q-is-rag-dead">Q: Is RAG dead?</h2>
<p>Question: Should I avoid using RAG for my AI application after reading that <a href="https://pashpashpash.substack.com/p/why-i-no-longer-recommend-rag-for" target="_blank">“RAG is dead”</a> for coding agents?</p>
<blockquote class="blockquote">
<p>Many developers are confused about when and how to use RAG after reading articles claiming “RAG is dead.” Understanding what RAG actually means versus the narrow marketing definitions will help you make better architectural decisions for your AI applications.</p>
</blockquote>
<p>The viral article claiming RAG is dead specifically argues against using <em>naive vector database retrieval</em> for autonomous coding agents, not RAG as a whole. This is a crucial distinction that many developers miss due to misleading marketing.</p>
<p>RAG simply means Retrieval-Augmented Generation - using retrieval to provide relevant context that improves your model’s output. The core principle remains essential: your LLM needs the right context to generate accurate answers. The question isn’t whether to use retrieval, but how to retrieve effectively.</p>
<p>For coding applications, naive vector similarity search often fails because code relationships are complex and contextual. Instead of abandoning retrieval entirely, modern coding assistants like Claude Code <a href="https://x.com/pashmerepat/status/1926717705660375463?s=46" target="_blank">still uses retrieval</a> —they just employ agentic search instead of relying solely on vector databases, similar to how human developers work.</p>
<p>You have multiple retrieval strategies available, ranging from simple keyword matching to embedding similarity to LLM-powered relevance filtering. The optimal approach depends on your specific use case, data characteristics, and performance requirements. Many production systems combine multiple strategies or use multi-hop retrieval guided by LLM agents.</p>
<p>Unfortunately, “RAG” has become a buzzword with no shared definition. Some people use it to mean any retrieval system, others restrict it to vector databases. Focus on the ultimate goal: getting your LLM the context it needs to succeed. Whether that’s through vector search, agentic exploration, or hybrid approaches is a product and engineering decision.</p>
<p>Rather than following categorical advice to avoid or embrace RAG, experiment with different retrieval approaches and measure what works best for your application. For more info on RAG evaluation and optimization, see <a href="../../../notes/llm/rag/not_dead.html">this series of posts</a>.</p>
<p><a href="../../../blog/posts/evals-faq/is-rag-dead.html" class="faq-individual-link">↗ Focus view</a></p>
</section>
<section id="q-how-should-i-approach-evaluating-my-rag-system" class="level2">
<h2 class="anchored" data-anchor-id="q-how-should-i-approach-evaluating-my-rag-system">Q: How should I approach evaluating my RAG system?</h2>
<p>RAG systems have two distinct components that require different evaluation approaches: retrieval and generation.</p>
<p>The retrieval component is a search problem. Evaluate it using traditional information retrieval (IR) metrics. Common examples include Recall@k (of all relevant documents, how many did you retrieve in the top k?), Precision@k (of the k documents retrieved, how many were relevant?), or MRR (how high up was the first relevant document?). The specific metrics you choose depend on your use case. These metrics are pure search metrics that measure whether you’re finding the right documents (more on this below).</p>
<p>To evaluate retrieval, create a dataset of queries paired with their relevant documents. Generate this synthetically by taking documents from your corpus, extracting key facts, then generating questions those facts would answer. This reverse process gives you query-document pairs for measuring retrieval performance without manual annotation.</p>
<p>For the generation component—how well the LLM uses retrieved context, whether it hallucinates, whether it answers the question—use the same evaluation procedures covered throughout this course: error analysis to identify failure modes, collecting human labels, building LLM-as-judge evaluators, and validating those judges against human annotations.</p>
<p>Jason Liu’s <a href="https://jxnl.co/writing/2025/05/19/there-are-only-6-rag-evals/" target="_blank">“There Are Only 6 RAG Evals”</a> provides a framework that maps well to this separation. His Tier 1 covers traditional IR metrics for retrieval. Tiers 2 and 3 evaluate relationships between Question, Context, and Answer—like whether the context is relevant (C|Q), whether the answer is faithful to context (A|C), and whether the answer addresses the question (A|Q).</p>
<p>In addition to Jason’s six evals, error analysis on your specific data may reveal domain-specific failure modes that warrant their own metrics. For example, a medical RAG system might consistently fail to distinguish between drug dosages for adults versus children, or a legal RAG might confuse jurisdictional boundaries. These patterns emerge only through systematic review of actual failures. Once identified, you can create targeted evaluators for these specific issues beyond the general framework.</p>
<p>Finally, when implementing Jason’s Tier 2 and 3 metrics, don’t just use prompts off the shelf. The standard LLM-as-judge process requires several steps: error analysis, prompt iteration, creating labeled examples, and measuring your judge’s accuracy against human labels. Once you know your judge’s True Positive and True Negative rates, you can correct its estimates to determine the actual failure rate in your system. Skip this validation and your judges may not reflect your actual quality criteria.</p>
<p>In summary, debug retrieval first using IR metrics, then tackle generation quality using properly validated LLM judges.</p>
<p><a href="../../../blog/posts/evals-faq/how-should-i-approach-evaluating-my-rag-system.html" class="faq-individual-link">↗ Focus view</a></p>
</section>
<section id="q-how-do-i-choose-the-right-chunk-size-for-my-document-processing-tasks" class="level2">
<h2 class="anchored" data-anchor-id="q-how-do-i-choose-the-right-chunk-size-for-my-document-processing-tasks">Q: How do I choose the right chunk size for my document processing tasks?</h2>
<p>Unlike RAG, where chunks are optimized for retrieval, document processing assumes the model will see every chunk. The goal is to split text so the model can reason effectively without being overwhelmed. Even if a document fits within the context window, it might be better to break it up. Long inputs can degrade performance due to attention bottlenecks, especially in the middle of the context. Two task types require different strategies:</p>
<section id="fixed-output-tasks-large-chunks" class="level3">
<h3 class="anchored" data-anchor-id="fixed-output-tasks-large-chunks">1. Fixed-Output Tasks → Large Chunks</h3>
<p>These are tasks where the output length doesn’t grow with input: extracting a number, answering a specific question, classifying a section. For example:</p>
<ul>
<li>“What’s the penalty clause in this contract?”</li>
<li>“What was the CEO’s salary in 2023?”</li>
</ul>
<p>Use the largest chunk (with caveats) that likely contains the answer. This reduces the number of queries and avoids context fragmentation. However, avoid adding irrelevant text. Models are sensitive to distraction, especially with large inputs. The middle parts of a long input might be under-attended. Furthermore, if cost and latency are a bottleneck, you should consider preprocessing or filtering the document (via keyword search or a lightweight retriever) to isolate relevant sections before feeding a huge chunk.</p>
</section>
<section id="expansive-output-tasks-smaller-chunks" class="level3">
<h3 class="anchored" data-anchor-id="expansive-output-tasks-smaller-chunks">2. Expansive-Output Tasks → Smaller Chunks</h3>
<p>These include summarization, exhaustive extraction, or any task where output grows with input. For example:</p>
<ul>
<li>“Summarize each section”</li>
<li>“List all customer complaints”</li>
</ul>
<p>In these cases, smaller chunks help preserve reasoning quality and output completeness. The standard approach is to process each chunk independently, then aggregate results (e.g., map-reduce). When sizing your chunks, try to respect content boundaries like paragraphs, sections, or chapters. Chunking also helps mitigate output limits. By breaking the task into pieces, each piece’s output can stay within limits.</p>
</section>
<section id="general-guidance" class="level3">
<h3 class="anchored" data-anchor-id="general-guidance">General Guidance</h3>
<p>It’s important to recognize <strong>why chunk size affects results</strong>. A larger chunk means the model has to reason over more information in one go – essentially, a heavier cognitive load. LLMs have limited capacity to <strong>retain and correlate details across a long text</strong>. If too much is packed in, the model might prioritize certain parts (commonly the beginning or end) and overlook or “forget” details in the middle. This can lead to overly coarse summaries or missed facts. In contrast, a smaller chunk bounds the problem: the model can pay full attention to that section. You are trading off <strong>global context for local focus</strong>.</p>
<p>No rule of thumb can perfectly determine the best chunk size for your use case – <strong>you should validate with experiments</strong>. The optimal chunk size can vary by domain and model. I treat chunk size as a hyperparameter to tune.</p>
<p><a href="../../../blog/posts/evals-faq/how-do-i-choose-the-right-chunk-size-for-my-document-processing-tasks.html" class="faq-individual-link">↗ Focus view</a></p>
</section>
</section>
<section id="q-how-do-i-debug-multi-turn-conversation-traces" class="level2">
<h2 class="anchored" data-anchor-id="q-how-do-i-debug-multi-turn-conversation-traces">Q: How do I debug multi-turn conversation traces?</h2>
<p>Start simple. Check if the whole conversation met the user’s goal with a pass/fail judgment. Look at the entire trace and focus on the first upstream failure. Read the user-visible parts first to understand if something went wrong. Only then dig into the technical details like tool calls and intermediate steps.</p>
<section id="multi-agent-trace-logging" class="level3">
<h3 class="anchored" data-anchor-id="multi-agent-trace-logging">Multi-agent trace logging</h3>
<p>For multi-agent flows, assign a session or trace ID to each user request and log every message with its source (which agent or tool), trace ID, and position in the sequence. This lets you reconstruct the full path from initial query to final result across all agents.</p>
</section>
<section id="annotation-strategy" class="level3">
<h3 class="anchored" data-anchor-id="annotation-strategy">Annotation strategy</h3>
<p>Annotate only the first failure in the trace initially—don’t worry about downstream failures since these often cascade from the first issue. Fixing upstream failures often resolves dependent downstream failures automatically. As you gain experience, you can annotate independent failure modes within the same trace to speed up overall error analysis.</p>
</section>
<section id="simplify-when-possible" class="level3">
<h3 class="anchored" data-anchor-id="simplify-when-possible">Simplify when possible</h3>
<p>When you find a failure, reproduce it with the simplest possible test case. Here’s an example: suppose a shopping bot gives the wrong return policy on turn 4 of a conversation. Before diving into the full multi-turn complexity, simplify it to a single turn: “What is the return window for product X1000?” If it still fails, you’ve proven the error isn’t about conversation context - it’s likely a basic retrieval or knowledge issue you can debug more easily.</p>
</section>
<section id="test-case-generation" class="level3">
<h3 class="anchored" data-anchor-id="test-case-generation">Test case generation</h3>
<p>You have two main approaches. First, simulate users with another LLM to create realistic multi-turn conversations. Second, use “N-1 testing” where you provide the first N-1 turns of a real conversation and test what happens next. The N-1 approach often works better since it uses actual conversation prefixes rather than fully synthetic interactions, but is less flexible.</p>
<p>The key is balancing thoroughness with efficiency. Not every multi-turn failure requires multi-turn analysis.</p>
<p><a href="../../../blog/posts/evals-faq/how-do-i-debug-multi-turn-conversation-traces.html" class="faq-individual-link">↗ Focus view</a></p>
</section>
</section>
<section id="q-how-do-i-evaluate-sessions-with-human-handoffs" class="level2">
<h2 class="anchored" data-anchor-id="q-how-do-i-evaluate-sessions-with-human-handoffs">Q: How do I evaluate sessions with human handoffs?</h2>
<p>Capture the complete user journey in your traces, including human handoffs. The trace continues until the user’s need is resolved or the session ends, not when AI hands off to a human. Log the handoff decision, why it occurred, context transferred, wait time, human actions, final resolution, and whether the human had sufficient context. Many failures occur at handoff boundaries where AI hands off too early, too late, or without proper context.</p>
<p>Evaluate handoffs as potential failure modes during error analysis. Ask: Was the handoff necessary? Did the AI provide adequate context? Track both handoff quality and handoff rate. Sometimes the best improvement reduces handoffs entirely rather than improving handoff execution.</p>
<p><a href="../../../blog/posts/evals-faq/how-do-i-evaluate-sessions-with-human-handoffs.html" class="faq-individual-link">↗ Focus view</a></p>
</section>
<section id="q-how-do-i-evaluate-complex-multi-step-workflows" class="level2">
<h2 class="anchored" data-anchor-id="q-how-do-i-evaluate-complex-multi-step-workflows">Q: How do I evaluate complex multi-step workflows?</h2>
<p>Log the entire workflow from initial trigger to final business outcome. Include LLM calls, tool usage, human approvals, and database writes in your traces. You will need this visibility to properly diagnose failures.</p>
<p>Use both outcome and process metrics. Outcome metrics verify the final result meets requirements: Was the business case complete? Accurate? Properly formatted? Process metrics evaluate efficiency: step count, time taken, resource usage. Process failures are often easier to debug since they’re more deterministic, so tackle them first.</p>
<p>Segment your error analysis by workflow stages. Early stage failures (understanding user input) differ from middle stage failures (data processing) and late stage failures (formatting output). Early stage improvements have more impact since errors cascade in LLM chains.</p>
<p>Use transition failure matrices to analyze where workflows break. Create a matrix showing the last successful state versus where the first failure occurred. This reveals failure hotspots and guides where to invest debugging effort.</p>
<p><a href="../../../blog/posts/evals-faq/how-do-i-evaluate-complex-multi-step-workflows.html" class="faq-individual-link">↗ Focus view</a></p>
</section>
<section id="q-how-do-i-evaluate-agentic-workflows" class="level2">
<h2 class="anchored" data-anchor-id="q-how-do-i-evaluate-agentic-workflows">Q: How do I evaluate agentic workflows?</h2>
<p>We recommend evaluating agentic workflows in two phases:</p>
<p><strong>1. End-to-end task success.</strong> Treat the agent as a black box and ask “did we meet the user’s goal?”. Define a precise success rule per task (exact answer, correct side-effect, etc.) and measure with human or <a href="https://hamel.dev/blog/posts/llm-judge/" target="_blank">aligned LLM judges</a>. Take note of the first upstream failure when conducting error analysis.</p>
<p>Once error analysis reveals which workflows fail most often, move to step-level diagnostics to understand why they’re failing.</p>
<p><strong>2. Step-level diagnostics.</strong> Assuming that you have sufficiently <a href="https://hamel.dev/blog/posts/evals/#logging-traces" target="_blank">instrumented your system</a> with details of tool calls and responses, you can score individual components such as: - <em>Tool choice</em>: was the selected tool appropriate? - <em>Parameter extraction</em>: were inputs complete and well-formed? - <em>Error handling</em>: did the agent recover from empty results or API failures? - <em>Context retention</em>: did it preserve earlier constraints? - <em>Efficiency</em>: how many steps, seconds, and tokens were spent? - <em>Goal checkpoints</em>: for long workflows verify key milestones.</p>
<p>Example: “Find Berkeley homes under $1M and schedule viewings” breaks into: parameters extracted correctly, relevant listings retrieved, availability checked, and calendar invites sent. Each checkpoint can pass or fail independently, making debugging tractable.</p>
<p><strong>Use transition failure matrices to understand error patterns.</strong> Create a matrix where rows represent the last successful state and columns represent where the first failure occurred. This is a great way to understand where the most failures occur.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://hamel.dev/blog/posts/evals-faq/images/shreya_matrix.png" class="img-fluid figure-img" style="width:75.0%" target="_blank"></p>
<figcaption>Transition failure matrix showing hotspots in text-to-SQL agent workflow</figcaption>
</figure>
</div>
<p>Transition matrices transform overwhelming agent complexity into actionable insights. Instead of drowning in individual trace reviews, you can immediately see that GenSQL → ExecSQL transitions cause 12 failures while DecideTool → PlanCal causes only 2. This data-driven approach guides where to invest debugging effort. Here is another <a href="https://www.figma.com/deck/nwRlh5renu4s4olaCsf9lG/Failure-is-a-Funnel?node-id=2009-927&amp;t=GJlTtxQ8bLJaQ92A-1" target="_blank">example</a> from Bryan Bischof, that is also a text-to-SQL agent:</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://hamel.dev/blog/posts/evals-faq/images/bischof_matrix.png" class="img-fluid figure-img" style="width:75.0%" target="_blank"></p>
<figcaption>Bischof, Bryan “Failure is A Funnel - Data Council, 2025”</figcaption>
</figure>
</div>
<p>In this example, Bryan shows variation in transition matrices across experiments. How you organize your transition matrix depends on the specifics of your application. For example, Bryan’s text-to-SQL agent has an inherent sequential workflow which he exploits for further analytical insight. You can watch his <a href="https://youtu.be/R_HnI9oTv3c?si=hRRhDiydHU5k6ikc" target="_blank">full talk</a> for more details.</p>
<div class="quarto-video ratio ratio-16x9"><iframe data-external="1" src="https://www.youtube.com/embed/R_HnI9oTv3c" title="" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen=""></iframe></div>
<p><strong>Creating Test Cases for Agent Failures</strong></p>
<p>Creating test cases for agent failures follows the same principles as our previous FAQ on debugging multi-turn conversation traces (i.e.&nbsp;try to reproduce the error in the simplest way possible, only use multi-turn tests when the failure actually requires conversation context, etc.).</p>
<p><a href="../../../blog/posts/evals-faq/how-do-i-evaluate-agentic-workflows.html" class="faq-individual-link">↗ Focus view</a></p>
<hr>
<div class="cta" style="text-align: center;">
<p><strong>👉 <em>Want to learn more about AI Evals? Check out our <a href="https://bit.ly/evals-ai" target="_blank">AI Evals course</a></em></strong>. It’s a live cohort with hands on exercises and office hours. Here is a <a href="https://bit.ly/evals-ai" target="_blank">25% discount code</a> for readers. 👈</p>
</div>
<hr>


</section>
</section>


<div id="quarto-appendix" class="default"><section id="footnotes" class="footnotes footnotes-end-of-document"><h2 class="anchored quarto-appendix-heading">Footnotes</h2>

<ol>
<li id="fn1"><p><a href="https://www.linkedin.com/in/intellectronica/" target="_blank">Eleanor Berger</a>, our wonderful TA.↩︎</p></li>
<li id="fn2"><p>Paul Graham, <a href="https://paulgraham.com/writes.html" target="_blank">“Writes and Write-Nots”</a>↩︎</p></li>
<li id="fn3"><p>Shreya Shankar, et al., <a href="https://arxiv.org/abs/2404.12272" target="_blank">“Who Validates the Validators? Aligning LLM-Assisted Evaluation of LLM Outputs with Human Preferences”</a>↩︎</p></li>
</ol>
</section></div> ]]></description>
  <category>LLMs</category>
  <category>evals</category>
  <guid>https://hamel.dev/blog/posts/evals-faq/</guid>
  <pubDate>Thu, 15 Jan 2026 08:00:00 GMT</pubDate>
  <media:content url="https://hamel.dev/blog/posts/evals-faq/images/eval_faq.png" medium="image" type="image/png" height="96" width="144"/>
</item>
<item>
  <title>Selecting The Right AI Evals Tool</title>
  <dc:creator>Hamel Husain</dc:creator>
  <link>https://hamel.dev/blog/posts/eval-tools/</link>
  <description><![CDATA[ 

<!-- Content inserted at the beginning of body tag -->
<!-- Google Tag Manager (noscript) -->
<noscript><iframe src="https://www.googletagmanager.com/ns.html?id=GTM-PKGWQMKL" height="0" width="0" style="display:none;visibility:hidden"></iframe></noscript>
  <!-- End Google Tag Manager (noscript) -->




<p>Over the past year, I’ve focused heavily on <a href="../../../blog/posts/evals-faq/index.html">AI Evals</a>, both in my consulting work and teaching. A question I get constantly is, “What’s the best tool for evals?”. I’ve always resisted answering directly for two reasons. First, people focus too much on tools instead of the process, thinking the tool will be an off-the-shelf solution when it rarely is. Second, the tools change so quickly that comparisons become outdated immediately.</p>
<p>Having used many of the popular eval tools, I can genuinely say that no single one is superior in every dimension. The “best” tool depends on your team’s skillset, technical stack, and maturity.</p>
<p>Instead of a feature-by-feature comparison, I think it’s more valuable to show you <em>how</em> a panel of data scientists skilled in evals assesses these tools. As part of my AI Evals <a href="https://maven.com/parlance-labs/evals?promoCode=evals-info-url" target="_blank">course</a>, we had three of the most dominant vendors—Langsmith, Braintrust, and Arize Phoenix complete the same homework <a href="https://github.com/ai-evals-course/recipe-chatbot" target="_blank">assignment</a>. This gave us a unique opportunity to see how they tackle the exact same challenge.</p>
<p>We recorded the entire process and live commentary, which is available below. We think this might be helpful in learning about the kinds of things you should consider when selecting a tool for your team.</p>
<p><em>Thanks to <a href="https://www.sh-reya.com/">Shreya Shankar</a> and <a href="https://x.com/BEBischof">Bryan Bischof</a> for serving as the panelists (alongside me).</em></p>
<section id="langsmith" class="level2">
<h2 class="anchored" data-anchor-id="langsmith">Langsmith</h2>
<p>With <a href="https://x.com/hwchase17">Harrison Chase</a>, CEO of LangChain.</p>
<div style="width: 70%; margin: auto;">
<div class="quarto-video ratio ratio-16x9"><iframe data-external="1" src="https://www.youtube.com/embed/y0vm_fjkejo" title="" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen=""></iframe></div>
</div>
</section>
<section id="braintrust" class="level2">
<h2 class="anchored" data-anchor-id="braintrust">Braintrust</h2>
<p>With <a href="https://x.com/waydegilliam">Wayde Gilliam</a>, former developer relations at Braintrust.</p>
<div style="width: 70%; margin: auto;">
<div class="quarto-video ratio ratio-16x9"><iframe data-external="1" src="https://www.youtube.com/embed/97iykOemOn4" title="" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen=""></iframe></div>
</div>
</section>
<section id="arize-phoenix" class="level2">
<h2 class="anchored" data-anchor-id="arize-phoenix">Arize Phoenix</h2>
<p>With <a href="https://www.linkedin.com/in/sallyann-delucia-59a381172/">SallyAnn DeLucia</a>, Technical AI Product Leader at Arize.</p>
<div style="width: 70%; margin: auto;">
<div class="quarto-video ratio ratio-16x9"><iframe data-external="1" src="https://www.youtube.com/embed/wcYnzHJlUR0" title="" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen=""></iframe></div>
</div>
<hr>
</section>
<section id="criteria-for-assessing-ai-evals-tools" class="level2">
<h2 class="anchored" data-anchor-id="criteria-for-assessing-ai-evals-tools">Criteria for Assessing AI Evals Tools</h2>
<p>Here are themes that consistently surfaced during our review.</p>
<section id="workflow-and-developer-experience" class="level3">
<h3 class="anchored" data-anchor-id="workflow-and-developer-experience">1. Workflow and Developer Experience</h3>
<p>Reducing friction is more important than any single feature. Concretely, you should be mindful of the time it takes to go from observing a failure to iterating on a solution. For example, we appreciated the ability to go from viewing a single trace to experimenting with that same trace in a playground. For some teams with data-science backgrounds, a notebook-centric workflow is ideal as it provides transparency and control. This happens to be my preferred workflow as well.</p>
<p>When considering a notebook-centric workflow, its important to pay attention to the ergonmics of the sdk. This often boils down to the quality of the documentation and integration with existing data tools.</p>
</section>
<section id="human-in-the-loop-support" class="level3">
<h3 class="anchored" data-anchor-id="human-in-the-loop-support">2. Human-in-the-Loop Support</h3>
<p>The best tools don’t try to automate away the human; they empower them. Since error analysis is the highest ROI activity in AI engineering, a tool’s ability to support efficient human review is paramount. Prioritize tools with first-class support for manual annotation and error analysis. As of this writing, one thing that is missing from many tools is <a href="../../../blog/posts/evals-faq/index.html#q-why-is-error-analysis-so-important-in-llm-evals-and-how-is-it-performed" target="_blank">axial coding</a>.</p>
</section>
<section id="transparency-and-control-vs.-magic" class="level3">
<h3 class="anchored" data-anchor-id="transparency-and-control-vs.-magic">3. Transparency and Control vs.&nbsp;“Magic”</h3>
<p>Be deeply skeptical of features that promise full automation without human validation, as these can create a powerful and dangerous illusion of confidence. For example, be wary of features where an AI agent both creates an evaluation rubric and then immediately scores the outputs. This “stacking of abstractions” often hides flaws behind a high score. Favor tools that give you control and visibility.</p>
</section>
<section id="ecosystem-integration-vs.-walled-gardens" class="level3">
<h3 class="anchored" data-anchor-id="ecosystem-integration-vs.-walled-gardens">4. Ecosystem Integration vs.&nbsp;Walled Gardens</h3>
<p>An eval tool should fit your stack, not force you to fit its stack. Assess how well a tool integrates with your existing technologies. Also, beware of proprietary DSLs as they can add friction. Finally, the ability to export data into common formats for analysis in a variety of environments is a must-have.</p>
</section>
<section id="conclusion" class="level3">
<h3 class="anchored" data-anchor-id="conclusion">Conclusion</h3>
<p>The right choice of tool depends on your team’s workflow, skillset, and specific needs. I hope seeing how our panel approached this evaluation provides a better framework for making your own decision.</p>
<p>As for me personally, I tend to use these tools as a backend data store and use Jupyter notebooks as well as my own <a href="../../../blog/posts/evals-faq/index.html#q-should-i-build-a-custom-annotation-tool-or-use-something-off-the-shelf" target="_blank">custom built</a> annotation interfaces for most of my needs.</p>
<hr>
<div class="callout callout-style-default callout-caution callout-titled">
<div class="callout-header d-flex align-content-center" data-bs-toggle="collapse" data-bs-target=".callout-1-contents" aria-controls="callout-1" aria-expanded="true" aria-label="Toggle callout">
<div class="callout-icon-container">
<i class="callout-icon"></i>
</div>
<div class="callout-title-container flex-fill">
<span class="screen-reader-only">Caution</span>Appendix: Vendor Snapshots (As of Mid-2025)
</div>
<div class="callout-btn-toggle d-inline-block border-0 py-1 ps-1 pe-0 float-end"><i class="callout-toggle"></i></div>
</div>
<div id="callout-1" class="callout-1-contents callout-collapse collapse show">
<div class="callout-body-container callout-body">
<p><strong>You should take these notes with a grain of salt. I recommend watch the videos above to get a sense of how we applied these criteria and where you might differ according to your neeeds.</strong></p>
<section id="langsmith-evaluation-notes" class="level3">
<h3 class="anchored" data-anchor-id="langsmith-evaluation-notes">Langsmith Evaluation Notes</h3>
<p><strong>Overall Sentiment</strong> The overall workflow is intuitive, especially for those new to formal evaluation processes. The UI guides you through creating datasets, running experiments, and annotating results.</p>
<p><strong>Positive Feedback / What We Liked</strong></p>
<ul>
<li><strong>Seamless Workflow from Trace to Playground:</strong> The transition from inspecting a trace to experimenting with it in the playground is very smooth.</li>
<li><strong>AI-Assisted Prompt Improvement:</strong> The “Prompt Canvas” feature is a powerful tool for prompt engineering.</li>
<li><strong>Dataset Creation and Management:</strong> You can easily create datasets by uploading files, and the schema detection helps structure the data correctly.</li>
<li><strong>Experimentation and Evaluation:</strong> The “Annotation Queue” is a dedicated interface for human review and labeling of traces, which is more efficient than using spreadsheets.</li>
</ul>
<p><strong>Critiques and Areas for Improvement</strong></p>
<ul>
<li><strong>Limited Side-by-Side Comparison:</strong> The UI doesn’t make it easy to see side-by-side comparisons of different prompt versions and their outputs.</li>
<li><strong>UI/UX Concerns:</strong> The UI can feel a bit cluttered, with a lot of options and information presented at once.</li>
<li><strong>Potential for Over-Automation:</strong> Features like AI-generated examples, while convenient, can lead to homogenous data.</li>
</ul>
</section>
<section id="braintrust-evaluation-notes" class="level3">
<h3 class="anchored" data-anchor-id="braintrust-evaluation-notes">Braintrust Evaluation Notes</h3>
<p><strong>Overall Sentiment</strong> The panel had a generally positive view of Braintrust, highlighting its clean UI and structured approach to evaluations. The tool’s emphasis on human-in-the-loop workflows was a significant strength.</p>
<p><strong>Positive Feedback / What We Liked</strong></p>
<ul>
<li><strong>Focus on a Structured Evals Process:</strong> The demonstration emphasized a solid, methodical approach, starting by involving subject-matter experts to create an initial dataset.</li>
<li><strong>Clean and Intuitive User Interface (UI):</strong> The panel found the UI to be clean and easier to navigate than other tools, with a particularly readable trace viewing screen.</li>
<li><strong>Strong Support for Human-in-the-Loop Workflows:</strong> The platform has dedicated UIs designed for human review and annotation, which is critical for creating high-quality datasets and performing error analysis.</li>
<li><strong>The “Money Table”:</strong> After annotating traces with failure modes, the final dataset view is an actionable output that allows teams to quickly sort, filter, and quantify the most common failure modes.</li>
</ul>
<p><strong>Critiques and Areas for Improvement</strong></p>
<ul>
<li><strong>The “Loop” AI Scorer:</strong> The most significant concern was the “Loop” feature, an AI agent that creates an evaluation rubric and then immediately scores the outputs, which could lead to a false sense of security.</li>
<li><strong>Reliance on a Proprietary Query Language (BTQL):</strong> The panel viewed the use of “BTQL” with mild skepticism, stating a preference for exporting data to a Jupyter notebook.</li>
<li><strong>Clunky Data Workflows:</strong> The process for generating and refining synthetic data seemed inefficient, requiring downloading and re-uploading data between steps.</li>
</ul>
</section>
<section id="arize-phoenix-evaluation-notes" class="level3">
<h3 class="anchored" data-anchor-id="arize-phoenix-evaluation-notes">Arize Phoenix Evaluation Notes</h3>
<p><strong>Overall Sentiment</strong> The panel had a generally positive view of Phoenix, with one panelist calling it one of his “favorite open source eval tools.” The tool is positioned as a developer-first, notebook-centric platform.</p>
<p><strong>Positive Feedback / What We Liked</strong></p>
<ul>
<li><strong>Notebook-Centric Workflow:</strong> The entire evaluation process was driven from a Jupyter notebook, giving the developer transparency and control. The ability to export annotated data back into a Pandas DataFrame was a powerful feature.</li>
<li><strong>UI &amp; Developer Experience:</strong> The prompt management UI was praised for being clear and easy to understand. The tight integration between traces and the “Playground” was also noted as a smooth workflow.</li>
<li><strong>Open Source &amp; Local-First Approach:</strong> Phoenix can be run entirely locally, providing a sense of control and transparency. As an open-source tool, it was noted for being “hackable.”</li>
</ul>
<p><strong>Critiques and Areas for Improvement</strong></p>
<ul>
<li><strong>UI Readability:</strong> The text in the output panes was difficult to read during the demonstration, with a possible lack of markdown rendering for model outputs.</li>
<li><strong>Metrics and Visualization:</strong> The tool displays point statistics for each run, but the panel found this of limited use and expressed a desire for aggregate visualizations like histograms to identify outliers.</li>
<li><strong>Prompt Management and Testing:</strong> The prompt editor treats the system prompt as one large, monolithic block of text. A more component-based approach where individual instructions could be toggled on and off (“ablated”) would be preferable for systematic testing.</li>
</ul>
</section>
</div>
</div>
</div>


</section>
</section>

 ]]></description>
  <category>AI</category>
  <category>Evals</category>
  <guid>https://hamel.dev/blog/posts/eval-tools/</guid>
  <pubDate>Wed, 01 Oct 2025 07:00:00 GMT</pubDate>
  <media:content url="https://hamel.dev/blog/posts/eval-tools/cover-img-2.png" medium="image" type="image/png" height="96" width="144"/>
</item>
<item>
  <title>Stop Saying RAG Is Dead</title>
  <link>https://hamel.dev/</link>
  <description>Why the future of RAG lies in better retrieval, not bigger context windows.
</description>
  <category>LLMs</category>
  <category>RAG</category>
  <guid>https://hamel.dev/</guid>
  <pubDate>Fri, 11 Jul 2025 07:00:00 GMT</pubDate>
</item>
<item>
  <title>Inspect AI, An OSS Python Library For LLM Evals</title>
  <link>https://hamel.dev/</link>
  <description>A look at Inspect AI with its creator, JJ Allaire - an open-source framework for building and running LLM evaluations.
</description>
  <category>LLMs</category>
  <category>Evals</category>
  <guid>https://hamel.dev/</guid>
  <pubDate>Mon, 23 Jun 2025 07:00:00 GMT</pubDate>
</item>
<item>
  <title>A Field Guide to Rapidly Improving AI Products</title>
  <dc:creator>Hamel Husain</dc:creator>
  <link>https://hamel.dev/blog/posts/field-guide/</link>
  <description><![CDATA[ 

<!-- Content inserted at the beginning of body tag -->
<!-- Google Tag Manager (noscript) -->
<noscript><iframe src="https://www.googletagmanager.com/ns.html?id=GTM-PKGWQMKL" height="0" width="0" style="display:none;visibility:hidden"></iframe></noscript>
  <!-- End Google Tag Manager (noscript) -->




<p>Most AI teams focus on the wrong things. Here’s a common scene from my consulting work:</p>
<div class="screenplay" style="border: 2px solid #ccc; border-radius: 5px; padding: 20px; font-family: Courier, monospace;">
<div style="margin-bottom: 15px;">
<div style="text-transform: uppercase; font-weight: bold;">
AI TEAM
</div>
<blockquote class="blockquote">
<p>
Here’s our agent architecture – we’ve got RAG here, a router there, and we’re using this new framework for…
</p>
</blockquote>
</div>
<div style="margin-bottom: 15px;">
<div style="text-transform: uppercase; font-weight: bold;">
ME
</div>
<blockquote class="blockquote">
<div style="font-style: italic;">
<p>
[Holding up my hand to pause the enthusiastic tech lead.]
</p>
</div>
<p>
“Can you show me how you’re measuring if any of this actually works?”
</p>
</blockquote>
</div>
<div style="margin-bottom: 15px;">
<div style="font-style: italic;">
… Room goes quiet
</div>
</div>
</div>
<p><br></p>
<p>This scene has played out dozens of times over the last two years. Teams invest weeks building complex AI systems, but can’t tell me if their changes are helping or hurting.</p>
<p>This isn’t surprising. With new tools and frameworks emerging weekly, it’s natural to focus on tangible things we can control – which vector database to use, which LLM provider to choose, which agent framework to adopt. But after helping 30+ companies build AI products, I’ve discovered the teams who succeed barely talk about tools at all. Instead, they obsess over measurement and iteration.</p>
<p>In this post, I’ll show you exactly how these successful teams operate. You’ll learn:</p>
<ol type="1">
<li>How error analysis consistently reveals the highest-ROI improvements</li>
<li>Why a simple data viewer is your most important AI investment</li>
<li>How to empower domain experts (not just engineers) to improve your AI</li>
<li>Why synthetic data is more effective than you think</li>
<li>How to maintain trust in your evaluation system</li>
<li>Why your AI roadmap should count experiments, not features</li>
</ol>
<p>I’ll explain each of these topics with real examples. While every situation is unique, you’ll see patterns that apply regardless of your domain or team size.</p>
<p>Let’s start by examining the most common mistake I see teams make – one that derails AI projects before they even begin.</p>
<section id="the-most-common-mistake-skipping-error-analysis" class="level2">
<h2 class="anchored" data-anchor-id="the-most-common-mistake-skipping-error-analysis">1. The Most Common Mistake: Skipping Error Analysis</h2>
<p>The “tools first” mindset is the most common mistake in AI development. Teams get caught up in architecture diagrams, frameworks, and dashboards while neglecting the process of actually understanding what’s working and what isn’t.</p>
<p>One client proudly showed me this evaluation dashboard:</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://hamel.dev/blog/posts/field-guide/images/dashboard.png" class="img-fluid figure-img"></p>
<figcaption>The kind of dashboard that foreshadows failure.</figcaption>
</figure>
</div>
<p>This is the “tools trap” – the belief that adopting the right tools or frameworks (in this case, generic metrics) will solve your AI problems. Generic metrics are worse than useless – they actively impede progress in two ways:</p>
<p>First, they create a <strong>false sense of measurement and progress</strong>. Teams think they’re data-driven because they have dashboards, but they’re tracking vanity metrics that don’t correlate with real user problems. I’ve seen teams celebrate improving their “helpfulness score” by 10% while their actual users were still struggling with basic tasks. It’s like optimizing your website’s load time while your checkout process is broken – you’re getting better at the wrong thing.</p>
<p>Second, too many metrics fragment your attention. Instead of focusing on the few metrics that matter for your specific use case, you’re trying to optimize multiple dimensions simultaneously. When everything is important, nothing is.</p>
<p>The alternative? Error analysis - the single most valuable activity in AI development and consistently the highest-ROI activity. Let me show you what effective error analysis looks like in practice.</p>
<section id="the-error-analysis-process" class="level3">
<h3 class="anchored" data-anchor-id="the-error-analysis-process">The Error Analysis Process</h3>
<p>When Jacob, the founder of <a href="https://nurtureboss.io/">Nurture Boss</a>, needed to improve their apartment-industry AI assistant, his team built a simple viewer to examine conversations between their AI and users. Next to each conversation was a space for open-ended notes about failure modes.</p>
<p>After annotating dozens of conversations, clear patterns emerged. Their AI was struggling with date handling – failing 66% of the time when users said things like “let’s schedule a tour two weeks from now.”</p>
<p>Instead of reaching for new tools, they: 1. Looked at actual conversation logs 2. Categorized the types of date-handling failures 3. Built specific tests to catch these issues 4. Measured improvement on these metrics</p>
<p>The result? Their date handling success rate improved from 33% to 95%.</p>
<p>Here’s Jacob explaining this process himself:</p>
<div class="quarto-video"><iframe data-external="1" src="https://www.youtube.com/embed/e2i6JbU2R-s" width="800" height="450" title="" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen=""></iframe></div>
</section>
<section id="bottom-up-vs.-top-down-analysis" class="level3">
<h3 class="anchored" data-anchor-id="bottom-up-vs.-top-down-analysis">Bottom-Up vs.&nbsp;Top-Down Analysis</h3>
<p>When identifying error types, you can take either a “top-down” or “bottom-up” approach.</p>
<p>The <strong>top-down</strong> approach starts with common metrics like “hallucination” or “toxicity” plus metrics unique to your task. While convenient, it often misses domain-specific issues.</p>
<p>The more effective <strong>bottom-up</strong> approach forces you to look at actual data and let metrics naturally emerge. At NurtureBoss, we started with a spreadsheet where each row represented a conversation. We wrote open-ended notes on any undesired behavior. Then we used an LLM to build a taxonomy of common failure modes. Finally, we mapped each row to specific failure mode labels and counted the frequency of each issue.</p>
<p>The results were striking - just three issues accounted for over 60% of all problems:</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://hamel.dev/blog/posts/field-guide/images/pivot.png" class="img-fluid figure-img" width="200"></p>
<figcaption>Excel Pivot Tables are a simple tool, but they work!</figcaption>
</figure>
</div>
<ul>
<li>Conversation flow issues (missing context, awkward responses)</li>
<li>Handoff failures (not recognizing when to transfer to humans)</li>
<li>Rescheduling problems (struggling with date handling)</li>
</ul>
<p>The impact was immediate. Jacob’s team had uncovered so many actionable insights that they needed several weeks just to implement fixes for the problems we’d already found.</p>
<p>If you’d like to see error analysis in action, we recorded a <a href="https://youtu.be/qH1dZ8JLLdU">live walkthrough here</a>.</p>
<p>This brings us to a crucial question: How do you make it easy for teams to look at their data? The answer leads us to what I consider the most important investment any AI team can make…</p>
</section>
</section>
<section id="the-most-important-ai-investment-a-simple-data-viewer" class="level2">
<h2 class="anchored" data-anchor-id="the-most-important-ai-investment-a-simple-data-viewer">2. The Most Important AI Investment: A Simple Data Viewer</h2>
<p>The single most impactful investment I’ve seen AI teams make isn’t a fancy evaluation dashboard – it’s building a customized interface that lets anyone examine what their AI is actually doing. I emphasize <em>customized</em> because every domain has unique needs that off-the-shelf tools rarely address. When reviewing apartment leasing conversations, you need to see the full chat history and scheduling context. For real estate queries, you need the property details and source documents right there. Even small UX decisions – like where to place metadata or which filters to expose – can make the difference between a tool people actually use and one they avoid.</p>
<p>I’ve watched teams struggle with generic labeling interfaces, hunting through multiple systems just to understand a single interaction. The friction adds up: clicking through to different systems to see context, copying error descriptions into separate tracking sheets, switching between tools to verify information. This friction doesn’t just slow teams down – it actively discourages the kind of systematic analysis that catches subtle issues.</p>
<p>Teams with thoughtfully designed data viewers iterate 10x faster than those without them. And here’s the thing: <strong>these tools can be built in hours using AI-assisted development</strong> (like Cursor or Loveable). The investment is minimal compared to the returns.</p>
<p>Let me show you what I mean. Here’s the data viewer built for NurtureBoss (which we discussed earlier):</p>
<div class="quarto-layout-panel" data-layout-ncol="2">
<div class="quarto-layout-row">
<div class="quarto-layout-cell" style="flex-basis: 50.0%;justify-content: flex-start;">
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://hamel.dev/blog/posts/field-guide/images/nboss_filter.png" class="img-fluid figure-img"></p>
<figcaption>Search and filter sessions</figcaption>
</figure>
</div>
</div>
<div class="quarto-layout-cell" style="flex-basis: 50.0%;justify-content: flex-start;">
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://hamel.dev/blog/posts/field-guide/images/nboss_annotate.png" class="img-fluid figure-img"></p>
<figcaption>Annotate and add notes</figcaption>
</figure>
</div>
</div>
</div>
<div class="quarto-layout-row">
<div class="quarto-layout-cell" style="flex-basis: 50.0%;justify-content: flex-start;">
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://hamel.dev/blog/posts/field-guide/images/nboss_analysis.png" class="img-fluid figure-img"></p>
<figcaption>Aggregate and count errors</figcaption>
</figure>
</div>
</div>
</div>
</div>
<p>Here’s what makes a good data annotation tool:</p>
<ol type="1">
<li>Show all context in one place. Don’t make users hunt through different systems to understand what happened.<br>
</li>
<li>Make feedback trivial to capture. One-click correct/incorrect buttons beat lengthy forms.</li>
<li>Capture open-ended feedback. This lets you capture nuanced issues that don’t fit into a pre-defined taxonomy.</li>
<li>Enable quick filtering and sorting. Teams need to easily dive into specific error types. In the example above, NurtureBoss can quickly filter by the channel (voice, text, chat) or the specific property they want to look at quickly.</li>
<li>Have hotkeys that allow users to navigate between data examples and annotate without clicking.</li>
</ol>
<p>It doesn’t matter what web frameworks you use - use whatever you are familiar with. Because I’m a python developer, my current favorite web framework is <a href="https://fastht.ml/docs/">FastHTML</a> coupled with <a href="https://www.answer.ai/posts/2025-01-15-monsterui.html">MonsterUI</a>, because it allows me to define the back-end and front-end code in one small python file.</p>
<p>The key is starting somewhere, even if it’s simple. I’ve found custom web apps provide the best experience, but if you’re just beginning, a spreadsheet is better than nothing. As your needs grow, you can evolve your tools accordingly.</p>
<p>This brings us to another counter-intuitive lesson: the people best positioned to improve your AI system are often the ones who know the least about AI.</p>
</section>
<section id="empower-domain-experts-to-write-prompts" class="level2">
<h2 class="anchored" data-anchor-id="empower-domain-experts-to-write-prompts">3. Empower Domain Experts To Write Prompts</h2>
<p>I recently worked with an education startup building an interactive learning platform with LLMs. Their product manager, a learning design expert, would create detailed PowerPoint decks explaining pedagogical principles and example dialogues. She’d present these to the engineering team, who would then translate her expertise into prompts.</p>
<p>But here’s the thing: prompts are just English. Having a learning expert communicate teaching principles through PowerPoint, only for engineers to translate that back into English prompts, created unnecessary friction. The most successful teams flip this model by giving domain experts tools to write and iterate on prompts directly.</p>
<section id="build-bridges-not-gatekeepers" class="level3">
<h3 class="anchored" data-anchor-id="build-bridges-not-gatekeepers">Build Bridges, Not Gatekeepers</h3>
<p>Prompt playgrounds are a great starting point for this. Tools like Arize, Langsmith and Braintrust let teams quickly test different prompts, feed in example datasets, and compare results. Here are some screenshots of these tools:</p>
<div class="quarto-layout-panel" data-layout-ncol="3">
<div class="quarto-layout-row">
<div class="quarto-layout-cell" style="flex-basis: 33.3%;justify-content: flex-start;">
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://hamel.dev/blog/posts/field-guide/images/pp_phoenix2.png" class="img-fluid figure-img"></p>
<figcaption>Arize Phoenix</figcaption>
</figure>
</div>
</div>
<div class="quarto-layout-cell" style="flex-basis: 33.3%;justify-content: flex-start;">
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://hamel.dev/blog/posts/field-guide/images/pp_langsmith.png" class="img-fluid figure-img"></p>
<figcaption>LangSmith</figcaption>
</figure>
</div>
</div>
<div class="quarto-layout-cell" style="flex-basis: 33.3%;justify-content: flex-start;">
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://hamel.dev/blog/posts/field-guide/images/pp_bt.png" class="img-fluid figure-img"></p>
<figcaption>Braintrust</figcaption>
</figure>
</div>
</div>
</div>
</div>
<p>But there’s a crucial next step that many teams miss: integrating prompt development into their application context. Most AI applications aren’t just prompts – They commonly involve RAG systems pulling from your knowledge base, agent orchestration coordinating multiple steps, and application-specific business logic. The most effective teams I’ve worked with go beyond standalone playgrounds. They build what I call <em><strong>integrated prompt environments</strong></em> – essentially admin versions of their actual user interface that expose prompt editing.</p>
<p>Here’s an illustration of what an integrated prompt environment might look like for a real estate AI assistant:</p>
<div class="quarto-layout-panel" data-layout-ncol="2">
<div class="quarto-layout-row">
<div class="quarto-layout-cell" style="flex-basis: 50.0%;justify-content: flex-start;">
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://hamel.dev/blog/posts/field-guide/images/ipe_before.png" class="img-fluid figure-img" width="700"></p>
<figcaption>The UI that users (real estate agents) see.</figcaption>
</figure>
</div>
</div>
<div class="quarto-layout-cell" style="flex-basis: 50.0%;justify-content: flex-start;">
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://hamel.dev/blog/posts/field-guide/images/ipe_after.png" class="img-fluid figure-img"></p>
<figcaption>The same UI, but with an “admin mode”used by the engineering &amp; product team to iterate on the prompt and debug issues.</figcaption>
</figure>
</div>
</div>
</div>
</div>
</section>
<section id="tips-for-communicating-with-domain-experts" class="level3">
<h3 class="anchored" data-anchor-id="tips-for-communicating-with-domain-experts">Tips For Communicating With Domain Experts</h3>
<p>There’s another barrier that often prevents domain experts from contributing effectively: unnecessary jargon. I was working with an education startup where engineers, product managers, and learning specialists were talking past each other in meetings. The engineers kept saying, “We’re going to build an agent that does XYZ,” when really the job to be done was writing a prompt. This created an artificial barrier – the learning specialists, who were the actual domain experts, felt like they couldn’t contribute because they didn’t understand “agents.”</p>
<p>This happens everywhere. I’ve seen it with lawyers at legal tech companies, psychologists at mental health startups, and doctors at healthcare firms. The magic of LLMs is that they make AI accessible through natural language, but we often destroy that advantage by wrapping everything in technical terminology.</p>
<p>Here’s a simple example of how to translate common AI jargon:</p>
<table class="caption-top table">
<colgroup>
<col style="width: 73%">
<col style="width: 26%">
</colgroup>
<thead>
<tr class="header">
<th>Instead of saying…</th>
<th>Say…</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td>“We’re implementing a RAG approach”</td>
<td>“We’re making sure the model has the right context to answer questions”</td>
</tr>
<tr class="even">
<td>“We need to prevent prompt injection”</td>
<td>“We need to make sure users can’t trick the AI into ignoring our rules”</td>
</tr>
<tr class="odd">
<td>“Our model suffers from hallucination issues”</td>
<td>“Sometimes the AI makes things up, so we need to check its answers”</td>
</tr>
</tbody>
</table>
<p>This doesn’t mean dumbing things down – it means being precise about what you’re actually doing. When you say “we’re building an agent,” what specific capability are you adding? Is it function calling? Tool use? Or just a better prompt? Being specific helps everyone understand what’s actually happening.</p>
<p>There’s nuance here. Technical terminology exists for a reason – it provides precision when talking with other technical stakeholders. The key is adapting your language to your audience.</p>
<p>The challenge many teams raise at this point is: “This all sounds great, but what if we don’t have any data yet? How can we look at examples or iterate on prompts when we’re just starting out?” That’s what we’ll talk about next.</p>
</section>
</section>
<section id="bootstrapping-your-ai-with-synthetic-data-is-effective-even-with-zero-users" class="level2">
<h2 class="anchored" data-anchor-id="bootstrapping-your-ai-with-synthetic-data-is-effective-even-with-zero-users">4. Bootstrapping Your AI With Synthetic Data Is Effective (Even With Zero Users)</h2>
<p>One of the most common roadblocks I hear from teams is: “We can’t do proper evaluation because we don’t have enough real user data yet.” This creates a chicken-and-egg problem – you need data to improve your AI, but you need a decent AI to get users who generate that data.</p>
<p>Fortunately, there’s a solution that works surprisingly well: synthetic data. LLMs can generate realistic test cases that cover the range of scenarios your AI will encounter.</p>
<p>As I wrote in my <a href="https://hamel.dev/blog/posts/llm-judge/#generating-data">LLM-as-a-Judge blog post</a>, synthetic data can be remarkably effective for evaluation. <a href="https://www.linkedin.com/in/bryan-bischof/">Bryan Bischof</a>, the former Head of AI at Hex, put it perfectly:</p>
<blockquote class="blockquote">
<p>“LLMs are surprisingly good at generating excellent - and diverse - examples of user prompts. This can be relevant for powering application features, and sneakily, for building Evals. If this sounds a bit like the Large Language Snake is eating its tail, I was just as surprised as you! All I can say is: it works, ship it.”</p>
</blockquote>
<section id="a-framework-for-generating-realistic-test-data" class="level3">
<h3 class="anchored" data-anchor-id="a-framework-for-generating-realistic-test-data">A Framework for Generating Realistic Test Data</h3>
<p>The key to effective synthetic data is choosing the right dimensions to test. While these dimensions will vary based on your specific needs, I find it helpful to think about three broad categories:</p>
<ol type="1">
<li><strong>Features</strong>: What capabilities does your AI need to support?</li>
<li><strong>Scenarios</strong>: What situations will it encounter?</li>
<li><strong>User Personas</strong>: Who will be using it and how?</li>
</ol>
<p>These aren’t the only dimensions you might care about – you might also want to test different tones of voice, levels of technical sophistication, or even different locales and languages. The important thing is identifying dimensions that matter for your specific use case.</p>
<p>For a real estate CRM AI assistant I worked on with <a href="https://www.rechat.com/">Rechat</a>, we defined these dimensions like this:</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb1" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb1-1">features <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> [</span>
<span id="cb1-2">    <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"property search"</span>,      <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Finding listings matching criteria</span></span>
<span id="cb1-3">    <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"market analysis"</span>,      <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Analyzing trends and pricing</span></span>
<span id="cb1-4">    <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"scheduling"</span>,          <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Setting up property viewings</span></span>
<span id="cb1-5">    <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"follow-up"</span>           <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Post-viewing communication</span></span>
<span id="cb1-6">]</span>
<span id="cb1-7"></span>
<span id="cb1-8">scenarios <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> [</span>
<span id="cb1-9">    <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"exact match"</span>,         <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># One perfect listing match</span></span>
<span id="cb1-10">    <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"multiple matches"</span>,    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Need to help user narrow down</span></span>
<span id="cb1-11">    <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"no matches"</span>,         <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Need to suggest alternatives</span></span>
<span id="cb1-12">    <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"invalid criteria"</span>     <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Help user correct search terms</span></span>
<span id="cb1-13">]</span>
<span id="cb1-14"></span>
<span id="cb1-15">personas <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> [</span>
<span id="cb1-16">    <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"first_time_buyer"</span>,    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Needs more guidance and explanation</span></span>
<span id="cb1-17">    <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"investor"</span>,           <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Focused on numbers and ROI</span></span>
<span id="cb1-18">    <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"luxury_client"</span>,      <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Expects white-glove service</span></span>
<span id="cb1-19">    <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"relocating_family"</span>   <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Has specific neighborhood/school needs</span></span>
<span id="cb1-20">]</span></code></pre></div></div>
<p>But having these dimensions defined is only half the battle. The real challenge is ensuring your synthetic data actually triggers the scenarios you want to test. This requires two things:</p>
<ol type="1">
<li>A test database with enough variety to support your scenarios</li>
<li>A way to verify that generated queries actually trigger intended scenarios</li>
</ol>
<p>For Rechat, we maintained a test database of listings that we knew would trigger different edge cases. Some teams prefer to use an anonymized copy of production data, but either way, you need to ensure your test data has enough variety to exercise the scenarios you care about.</p>
<p>Here’s an example of how we might use these dimensions with real data to generate test cases for the property search feature (this is just pseudo-code, and very illustrative):</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb2" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb2-1"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> generate_search_query(scenario, persona, listing_db):</span>
<span id="cb2-2">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">"""Generate a realistic user query about listings"""</span></span>
<span id="cb2-3">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Pull real listing data to ground the generation</span></span>
<span id="cb2-4">    sample_listings <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> listing_db.get_sample_listings(</span>
<span id="cb2-5">        price_range<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>persona.price_range,</span>
<span id="cb2-6">        location<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>persona.preferred_areas</span>
<span id="cb2-7">    )</span>
<span id="cb2-8">    </span>
<span id="cb2-9">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Verify we have listings that will trigger our scenario</span></span>
<span id="cb2-10">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">if</span> scenario <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">==</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"multiple_matches"</span> <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">and</span> <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">len</span>(sample_listings) <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&lt;</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span>:</span>
<span id="cb2-11">        <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">raise</span> <span class="pp" style="color: #AD0000;
background-color: null;
font-style: inherit;">ValueError</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Need multiple listings for this scenario"</span>)</span>
<span id="cb2-12">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">if</span> scenario <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">==</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"no_matches"</span> <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">and</span> <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">len</span>(sample_listings) <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&gt;</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>:</span>
<span id="cb2-13">        <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">raise</span> <span class="pp" style="color: #AD0000;
background-color: null;
font-style: inherit;">ValueError</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Found matches when testing no-match scenario"</span>)</span>
<span id="cb2-14">    </span>
<span id="cb2-15">    prompt <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">f"""</span></span>
<span id="cb2-16"><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">    You are an expert real estate agent who is searching for listings. You are given a customer type and a scenario.</span></span>
<span id="cb2-17"><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">    </span></span>
<span id="cb2-18"><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">    Your job is to generate a natural language query you would use to search these listings.</span></span>
<span id="cb2-19"><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">    </span></span>
<span id="cb2-20"><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">    Context:</span></span>
<span id="cb2-21"><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">    - Customer type: </span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span>persona<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">.</span>description<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">}</span></span>
<span id="cb2-22"><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">    - Scenario: </span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span>scenario<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">}</span></span>
<span id="cb2-23"><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">    </span></span>
<span id="cb2-24"><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">    Use these actual listings as reference:</span></span>
<span id="cb2-25"><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">    </span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span>format_listings(sample_listings)<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">}</span></span>
<span id="cb2-26"><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">    </span></span>
<span id="cb2-27"><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">    The query should reflect the customer type and the scenario.</span></span>
<span id="cb2-28"></span>
<span id="cb2-29"><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">    Example query: Find homes in the 75019 zip code, 3 bedrooms, 2 bathrooms, price range $750k - $1M for an investor.</span></span>
<span id="cb2-30"><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">    """</span></span>
<span id="cb2-31">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">return</span> generate_with_llm(prompt)</span></code></pre></div></div>
<p>This produced realistic queries like:</p>
<table class="caption-top table">
<colgroup>
<col style="width: 20%">
<col style="width: 22%">
<col style="width: 20%">
<col style="width: 37%">
</colgroup>
<thead>
<tr class="header">
<th>Feature</th>
<th>Scenario</th>
<th>Persona</th>
<th>Generated Query</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td>property search</td>
<td>multiple matches</td>
<td>first_time_buyer</td>
<td>“Looking for 3-bedroom homes under $500k in the Riverside area. Would love something close to parks since we have young kids.”</td>
</tr>
<tr class="even">
<td>market analysis</td>
<td>no matches</td>
<td>investor</td>
<td>“Need comps for 123 Oak St.&nbsp;Specifically interested in rental yield comparison with similar properties in a 2-mile radius.”</td>
</tr>
</tbody>
</table>
<p>The key to useful synthetic data is grounding it in real system constraints. For the real-estate AI assistant, this means:</p>
<ol type="1">
<li>Using real listing IDs and addresses from their database</li>
<li>Incorporating actual agent schedules and availability windows</li>
<li>Respecting business rules like showing restrictions and notice periods</li>
<li>Including market-specific details like HOA requirements or local regulations</li>
</ol>
<p>We then feed these test cases through Lucy and log the interactions. This gives us a rich dataset to analyze, showing exactly how the AI handles different situations with real system constraints. This approach helped us fix issues before they affected real users.</p>
<p>Sometimes you don’t have access to a production database, especially for new products. In these cases, use LLMs to generate both test queries and the underlying test data. For a real estate AI assistant, this might mean creating synthetic property listings with realistic attributes – prices that match market ranges, valid addresses with real street names, and amenities appropriate for each property type. The key is grounding synthetic data in real-world constraints to make it useful for testing. The specifics of generating robust synthetic databases are beyond the scope of this post.</p>
</section>
<section id="guidelines-for-using-synthetic-data" class="level3">
<h3 class="anchored" data-anchor-id="guidelines-for-using-synthetic-data">Guidelines for Using Synthetic Data</h3>
<p>When generating synthetic data, follow these key principles to ensure it’s effective:</p>
<ol type="1">
<li><p><strong>Diversify your dataset</strong>: Create examples that cover a wide range of features, scenarios, and personas. As I wrote in my <a href="https://hamel.dev/blog/posts/llm-judge/">LLM-as-a-Judge post</a>, this diversity helps you identify edge cases and failure modes you might not anticipate otherwise.</p></li>
<li><p><strong>Generate user inputs, not outputs</strong>: Use LLMs to generate realistic user queries or inputs, not the expected AI responses. This prevents your synthetic data from inheriting the biases or limitations of the generating model.</p></li>
<li><p><strong>Incorporate real system constraints</strong>: Ground your synthetic data in actual system limitations and data. For example, when testing a scheduling feature, use real availability windows and booking rules.</p></li>
<li><p><strong>Verify scenario coverage</strong>: Ensure your generated data actually triggers the scenarios you want to test. A query intended to test “no matches found” should actually return zero results when run against your system.</p></li>
<li><p><strong>Start simple, then add complexity</strong>: Begin with straightforward test cases before adding nuance. This helps isolate issues and establish a baseline before tackling edge cases.</p></li>
</ol>
<p>This approach isn’t just theoretical – it’s been proven in production across dozens of companies. What often starts as a stopgap measure becomes a permanent part of the evaluation infrastructure, even after real user data becomes available.</p>
<p>Let’s look at how to maintain trust in your evaluation system as you scale…</p>
</section>
</section>
<section id="maintaining-trust-in-evals-is-critical" class="level2">
<h2 class="anchored" data-anchor-id="maintaining-trust-in-evals-is-critical">5. Maintaining Trust In Evals Is Critical</h2>
<p>This is a pattern I’ve seen repeatedly: teams build evaluation systems, then gradually lose faith in them. Sometimes it’s because the metrics don’t align with what they observe in production. Other times, it’s because the evaluations become too complex to interpret. Either way, the result is the same – the team reverts to making decisions based on gut feeling and anecdotal feedback, undermining the entire purpose of having evaluations.</p>
<p>Maintaining trust in your evaluation system is just as important as building it in the first place. Here’s how the most successful teams approach this challenge:</p>
<section id="understanding-criteria-drift" class="level3">
<h3 class="anchored" data-anchor-id="understanding-criteria-drift">Understanding Criteria Drift</h3>
<p>One of the most insidious problems in AI evaluation is “criteria drift” – a phenomenon where evaluation criteria evolve as you observe more model outputs. In their paper <a href="https://arxiv.org/abs/2404.12272">“Who Validates the Validators?”</a>, Shankar et al.&nbsp;describe this phenomenon:</p>
<blockquote class="blockquote">
<p>“To grade outputs, people need to externalize and define their evaluation criteria; however, the process of grading outputs helps them to define that very criteria.”</p>
</blockquote>
<p>This creates a paradox: you can’t fully define your evaluation criteria until you’ve seen a wide range of outputs, but you need criteria to evaluate those outputs in the first place. In other words, <strong>it is impossible to completely determine evaluation criteria prior to human judging of LLM outputs</strong>.</p>
<p>I’ve observed this firsthand when working with Phillip Carter at Honeycomb on their <a href="https://www.honeycomb.io/blog/introducing-query-assistant">Query Assistant</a> feature. As we evaluated the AI’s ability to generate database queries, Phillip noticed something interesting:</p>
<blockquote class="blockquote">
<p>“Seeing how the LLM breaks down its reasoning made me realize I wasn’t being consistent about how I judged certain edge cases.”</p>
</blockquote>
<p>The process of reviewing AI outputs helped him articulate his own evaluation standards more clearly. This isn’t a sign of poor planning – it’s an inherent characteristic of working with AI systems that produce diverse and sometimes unexpected outputs.</p>
<p>The teams that maintain trust in their evaluation systems embrace this reality rather than fighting it. They treat evaluation criteria as living documents that evolve alongside their understanding of the problem space. They also recognize that different stakeholders might have different (sometimes contradictory) criteria, and they work to reconcile these perspectives rather than imposing a single standard.</p>
</section>
<section id="creating-trustworthy-evaluation-systems" class="level3">
<h3 class="anchored" data-anchor-id="creating-trustworthy-evaluation-systems">Creating Trustworthy Evaluation Systems</h3>
<p>So how do you build evaluation systems that remain trustworthy despite criteria drift? Here are the approaches I’ve found most effective:</p>
<section id="favor-binary-decisions-over-arbitrary-scales" class="level4">
<h4 class="anchored" data-anchor-id="favor-binary-decisions-over-arbitrary-scales">1. Favor Binary Decisions Over Arbitrary Scales</h4>
<p>As I wrote in my <a href="https://hamel.dev/blog/posts/llm-judge/#why-are-simple-passfail-metrics-important">LLM-as-a-Judge post</a>, binary decisions provide clarity that more complex scales often obscure. When faced with a 1-5 scale, evaluators frequently struggle with the difference between a 3 and a 4, introducing inconsistency and subjectivity. What exactly distinguishes “somewhat helpful” from “helpful”? These boundary cases consume disproportionate mental energy and create noise in your evaluation data. And even when businesses use a 1-5 scale, they inevitably ask where to draw the line for “good enough” or to trigger intervention, forcing a binary decision anyway.</p>
<p>In contrast, a binary pass/fail forces evaluators to make a clear judgment: did this output achieve its purpose or not? This clarity extends to measuring progress – a 10% increase in passing outputs is immediately meaningful, while a 0.5-point improvement on a 5-point scale requires interpretation.</p>
<p>I’ve found that teams who resist binary evaluation often do so because they want to capture nuance. But nuance isn’t lost – it’s just moved to the qualitative critique that accompanies the judgment. The critique provides rich context about why something passed or failed, and what specific aspects could be improved, while the binary decision creates actionable clarity about whether improvement is needed at all.</p>
</section>
<section id="enhance-binary-judgments-with-detailed-critiques" class="level4">
<h4 class="anchored" data-anchor-id="enhance-binary-judgments-with-detailed-critiques">2. Enhance Binary Judgments With Detailed Critiques</h4>
<p>While binary decisions provide clarity, they work best when paired with detailed critiques that capture the nuance of why something passed or failed. This combination gives you the best of both worlds: clear, actionable metrics and rich contextual understanding.</p>
<p>For example, when evaluating a response that correctly answers a user’s question but contains unnecessary information, a good critique might read:</p>
<blockquote class="blockquote">
<p>“The AI successfully provided the market analysis requested (PASS), but included excessive detail about neighborhood demographics that wasn’t relevant to the investment question. This makes the response longer than necessary and potentially distracting.”</p>
</blockquote>
<p>These critiques serve multiple functions beyond just explanation. They force domain experts to externalize implicit knowledge – I’ve seen legal experts move from vague feelings that something “doesn’t sound right” to articulating specific issues with citation formats or reasoning patterns that can be systematically addressed.</p>
<p>When included as few-shot examples in judge prompts, these critiques improve the LLM’s ability to reason about complex edge cases. I’ve found this approach often yields 15-20% higher agreement rates between human and LLM evaluations compared to prompts without example critiques. The critiques also provide excellent raw material for generating high-quality synthetic data, creating a flywheel for improvement.</p>
</section>
<section id="measure-alignment-between-automated-evals-and-human-judgment" class="level4">
<h4 class="anchored" data-anchor-id="measure-alignment-between-automated-evals-and-human-judgment">3. Measure Alignment Between Automated Evals and Human Judgment</h4>
<p>If you’re using LLMs to evaluate outputs (which is often necessary at scale), it’s crucial to regularly check how well these automated evaluations align with human judgment.</p>
<p>This is particularly important given our natural tendency to over-trust AI systems. As Shankar et al.&nbsp;note in <a href="https://arxiv.org/abs/2404.12272">“Who Validates the Validators?”</a>, the lack of tools to validate evaluator quality is concerning</p>
<blockquote class="blockquote">
<p>Research shows people tend to over-rely and over-trust AI systems. For instance, in one high profile incident, researchers from MIT posted a pre-print on arXiv claiming that GPT-4 could ace the MIT EECS exam. Within hours, [the] work [was] debunked … citing problems arising from over-reliance on GPT-4 to grade itself.”</p>
</blockquote>
<p>This over-trust problem extends beyond self-evaluation. Research has shown that LLMs can be biased by simple factors like the ordering of options in a set, or even seemingly innocuous formatting changes in prompts. Without rigorous human validation, these biases can silently undermine your evaluation system.</p>
<p>When working with Honeycomb, we tracked agreement rates between our LLM-as-a-judge and Phillip’s evaluations:</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://hamel.dev/blog/posts/field-guide/images/score.png" class="img-fluid figure-img"></p>
<figcaption>Agreement rates between LLM evaluator and human expert. More details <a href="https://hamel.dev/blog/posts/evals/#automated-evaluation-w-llms">here</a>.</figcaption>
</figure>
</div>
<p>It took three iterations to achieve &gt;90% agreement, but this investment paid off in a system the team could trust. Without this validation step, automated evaluations often drift from human expectations over time, especially as the distribution of inputs changes. You can <a href="https://hamel.dev/blog/posts/evals/#automated-evaluation-w-llms">read more about this here</a>.</p>
<p>Tools like <a href="https://eugeneyan.com/writing/aligneval/">Eugene Yan’s AlignEval</a> demonstrate this alignment process beautifully. It provides a simple interface where you upload data, label examples with a binary “good” or “bad,” and then evaluate LLM-based judges against those human judgments. What makes it effective is how it streamlines the workflow – you can quickly see where automated evaluations diverge from your preferences, refine your criteria based on these insights, and measure improvement over time. This approach reinforces that alignment isn’t a one-time setup but an ongoing conversation between human judgment and automated evaluation.</p>
</section>
</section>
<section id="scaling-without-losing-trust" class="level3">
<h3 class="anchored" data-anchor-id="scaling-without-losing-trust">Scaling Without Losing Trust</h3>
<p>As your AI system grows, you’ll inevitably face pressure to reduce the human effort involved in evaluation. This is where many teams go wrong – they automate too much, too quickly, and lose the human connection that keeps their evaluations grounded.</p>
<p>The most successful teams take a more measured approach:</p>
<ol type="1">
<li><p><strong>Start with high human involvement</strong>: In the early stages, have domain experts evaluate a significant percentage of outputs.</p></li>
<li><p><strong>Study alignment patterns</strong>: Rather than automating evaluation, focus on understanding where automated evaluations align with human judgment and where they diverge. This helps you identify which types of cases need more careful human attention.</p></li>
<li><p><strong>Use strategic sampling</strong>: Rather than evaluating every output, use statistical techniques to sample outputs that provide the most information, particularly focusing on areas where alignment is weakest.</p></li>
<li><p><strong>Maintain regular calibration</strong>: Even as you scale, continue to compare automated evaluations against human judgment regularly, using these comparisons to refine your understanding of when to trust automated evaluations.</p></li>
</ol>
<p>Scaling evaluation isn’t just about reducing human effort – it’s about directing that effort where it adds the most value. By focusing human attention on the most challenging or informative cases, you can maintain quality even as your system grows.</p>
<p>Now that we’ve covered how to maintain trust in your evaluations, let’s talk about a fundamental shift in how you should approach AI development roadmaps…</p>
</section>
</section>
<section id="your-ai-roadmap-should-count-experiments-not-features" class="level2">
<h2 class="anchored" data-anchor-id="your-ai-roadmap-should-count-experiments-not-features">6. Your AI Roadmap Should Count Experiments, Not Features</h2>
<p>If you’ve worked in software development, you’re familiar with traditional roadmaps: a list of features with target delivery dates. Teams commit to shipping specific functionality by specific deadlines, and success is measured by how closely they hit those targets.</p>
<p>This approach fails spectacularly with AI.</p>
<p>I’ve watched teams commit to roadmaps like “Launch sentiment analysis by Q2” or “Deploy agent-based customer support by end of year,” only to discover that the technology simply isn’t ready to meet their quality bar. They either ship something subpar to hit the deadline or miss the deadline entirely. Either way, trust erodes.</p>
<p>The fundamental problem is that traditional roadmaps assume we know what’s possible. With conventional software, that’s often true – given enough time and resources, you can build most features reliably. With AI, especially at the cutting edge, you’re constantly testing the boundaries of what’s feasible.</p>
<section id="experiments-vs.-features" class="level3">
<h3 class="anchored" data-anchor-id="experiments-vs.-features">Experiments vs.&nbsp;Features</h3>
<p><a href="https://www.linkedin.com/in/bryan-bischof/">Bryan Bischof</a>, Former Head of AI at Hex, introduced me to what he calls a “capability funnel” approach to AI roadmaps. This strategy reframes how we think about AI development progress.</p>
<p>Instead of defining success as shipping a feature, the capability funnel breaks down AI performance into progressive levels of utility. At the top of the funnel is the most basic functionality – can the system respond at all? At the bottom is fully solving the user’s job to be done. Between these points are various stages of increasing usefulness.</p>
<p>For example, in a query assistant, the capability funnel might look like: 1. Can generate syntactically valid queries (basic functionality) 2. Can generate queries that execute without errors 3. Can generate queries that return relevant results 4. Can generate queries that match user intent 5. Can generate optimal queries that solve the user’s problem (complete solution)</p>
<p>This approach acknowledges that AI progress isn’t binary – it’s about gradually improving capabilities across multiple dimensions. It also provides a framework for measuring progress even when you haven’t reached the final goal.</p>
<p>The most successful teams I’ve worked with structure their roadmaps around experiments rather than features. Instead of committing to specific outcomes, they commit to a cadence of experimentation, learning, and iteration.</p>
<p><a href="https://eugeneyan.com/">Eugene Yan</a>, an applied scientist at Amazon, shared how he approaches ML project planning with leadership - a process that, while originally developed for traditional machine learning, applies equally well to modern LLM development:</p>
<blockquote class="blockquote">
<p>“Here’s a common timeline. First, I take two weeks to do a data feasibility analysis, i.e”do I have the right data?” […] Then I take an additional month to do a technical feasibility analysis, i.e “can AI solve this?” After that, if it still works I’ll spend six weeks building a prototype we can A/B test.”</p>
</blockquote>
<p>While LLMs might not require the same kind of feature engineering or model training as traditional ML, the underlying principle remains the same: time-box your exploration, establish clear decision points, and focus on proving feasibility before committing to full implementation. This approach gives leadership confidence that resources won’t be wasted on open-ended exploration, while giving the team the freedom to learn and adapt as they go.</p>
</section>
<section id="the-foundation-evaluation-infrastructure" class="level3">
<h3 class="anchored" data-anchor-id="the-foundation-evaluation-infrastructure">The Foundation: Evaluation Infrastructure</h3>
<p>The key to making an experiment-based roadmap work is having robust evaluation infrastructure. Without it, you’re just guessing whether your experiments are working. With it, you can rapidly iterate, test hypotheses, and build on successes.</p>
<p>I saw this firsthand during the early development of GitHub Copilot. What most people don’t realize is that the team invested heavily in building sophisticated offline evaluation infrastructure. They created systems that could test code completions against a very large corpus of repositories on GitHub, leveraging unit tests that already existed in high-quality codebases as an automated way to verify completion correctness. This was a massive engineering undertaking – they had to build systems that could clone repositories at scale, set up their environments, run their test suites, and analyze the results, all while handling the incredible diversity of programming languages, frameworks, and testing approaches.</p>
<p>This wasn’t wasted time—it was the foundation that accelerated everything. With solid evaluation in place, the team ran thousands of experiments, quickly identified what worked, and could say with confidence “this change improved quality by X%” instead of relying on gut feelings. While the upfront investment in evaluation feels slow, it prevents endless debates about whether changes help or hurt, and dramatically speeds up innovation later.</p>
</section>
<section id="communicating-this-to-stakeholders" class="level3">
<h3 class="anchored" data-anchor-id="communicating-this-to-stakeholders">Communicating This to Stakeholders</h3>
<p>The challenge, of course, is that executives often want certainty. They want to know when features will ship and what they’ll do. How do you bridge this gap?</p>
<p>The key is to shift the conversation from outputs to outcomes. Instead of promising specific features by specific dates, commit to a process that will maximize the chances of achieving the desired business outcomes.</p>
<p>Eugene shared how he handles these conversations:</p>
<blockquote class="blockquote">
<p>“I try to reassure leadership with timeboxes. At the end of three months, if it works out, then we move it to production. At any step of the way, if it doesn’t work out, we pivot.”</p>
</blockquote>
<p>This approach gives stakeholders clear decision points while acknowledging the inherent uncertainty in AI development. It also helps manage expectations about timelines – instead of promising a feature in six months, you’re promising a clear understanding of whether that feature is feasible in three months.</p>
<p>Bryan’s capability funnel approach provides another powerful communication tool. It allows teams to show concrete progress through the funnel stages, even when the final solution isn’t ready. It also helps executives understand where problems are occurring and make informed decisions about where to invest resources.</p>
</section>
<section id="build-a-culture-of-experimentation-through-failure-sharing" class="level3">
<h3 class="anchored" data-anchor-id="build-a-culture-of-experimentation-through-failure-sharing">Build a Culture of Experimentation Through Failure Sharing</h3>
<p>Perhaps the most counterintuitive aspect of this approach is the emphasis on learning from failures. In traditional software development, failures are often hidden or downplayed. In AI development, they’re the primary source of learning.</p>
<p>Eugene operationalizes this at his organization through what he calls a “fifteen-five” – a weekly update that takes fifteen minutes to write and five minutes to read:</p>
<blockquote class="blockquote">
<p>“In my fifteen-fives, I document my failures and my successes. Within our team, we also have weekly”no-prep sharing sessions” where we discuss what we’ve been working on and what we’ve learned. When I do this, I go out of my way to share failures.”</p>
</blockquote>
<p>This practice normalizes failure as part of the learning process. It shows that even experienced practitioners encounter dead ends, and it accelerates team learning by sharing those experiences openly. And by celebrating the process of experimentation rather than just the outcomes, teams create an environment where people feel safe taking risks and learning from failures.</p>
</section>
<section id="a-better-way-forward" class="level3">
<h3 class="anchored" data-anchor-id="a-better-way-forward">A Better Way Forward</h3>
<p>So what does an experiment-based roadmap look like in practice? Here’s a simplified example from a content moderation project Eugene worked on:</p>
<blockquote class="blockquote">
<p>“I was asked to do content moderation. I said, ‘It’s uncertain whether we’ll meet that goal. It’s uncertain even if that goal is feasible with our data, or what machine learning techniques would work. But here’s my experimentation roadmap. Here are the techniques I’m gonna try, and I’m gonna update you at a two-week cadence.’”</p>
</blockquote>
<p>The roadmap didn’t promise specific features or capabilities. Instead, it committed to a systematic exploration of possible approaches, with regular check-ins to assess progress and pivot if necessary.</p>
<p>The results were telling:</p>
<blockquote class="blockquote">
<p>“For the first two to three months, nothing worked. […] And then [a breakthrough] came out. […] Within a month, that problem was solved. So you can see that in the first quarter or even four months, it was going nowhere. […] But then you can also see that all of a sudden, some new technology comes along, some new paradigm, some new reframing comes along that just [solves] 80% of [the problem].”</p>
</blockquote>
<p>This pattern – long periods of apparent failure followed by breakthroughs – is common in AI development. Traditional feature-based roadmaps would have killed the project after months of “failure,” missing the eventual breakthrough.</p>
<p>By focusing on experiments rather than features, teams create space for these breakthroughs to emerge. They also build the infrastructure and processes that make breakthroughs more likely – data pipelines, evaluation frameworks, and rapid iteration cycles.</p>
<p>The most successful teams I’ve worked with start by building evaluation infrastructure before committing to specific features. They create tools that make iteration faster and focus on processes that support rapid experimentation. This approach might seem slower at first, but it dramatically accelerates development in the long run by enabling teams to learn and adapt quickly.</p>
<p>The key metric for AI roadmaps isn’t features shipped – it’s experiments run. The teams that win are those that can run more experiments, learn faster, and iterate more quickly than their competitors. And the foundation for this rapid experimentation is always the same: robust, trusted evaluation infrastructure that gives everyone confidence in the results.</p>
<p>By reframing your roadmap around experiments rather than features, you create the conditions for similar breakthroughs in your own organization.</p>
</section>
</section>
<section id="conclusion" class="level2">
<h2 class="anchored" data-anchor-id="conclusion">Conclusion</h2>
<p>Throughout this post, I’ve shared patterns I’ve observed across dozens of AI implementations. The most successful teams aren’t the ones with the most sophisticated tools or the most advanced models – they’re the ones that master the fundamentals of measurement, iteration, and learning.</p>
<p>The core principles are surprisingly simple:</p>
<ol type="1">
<li><p><strong>Look at your data.</strong> Nothing replaces the insight gained from examining real examples. Error analysis consistently reveals the highest-ROI improvements.</p></li>
<li><p><strong>Build simple tools that remove friction.</strong> Custom data viewers that make it easy to examine AI outputs yield more insights than complex dashboards with generic metrics.</p></li>
<li><p><strong>Empower domain experts.</strong> The people who understand your domain best are often the ones who can most effectively improve your AI, regardless of their technical background.</p></li>
<li><p><strong>Use synthetic data strategically.</strong> You don’t need real users to start testing and improving your AI. Thoughtfully generated synthetic data can bootstrap your evaluation process.</p></li>
<li><p><strong>Maintain trust in your evaluations.</strong> Binary judgments with detailed critiques create clarity while preserving nuance. Regular alignment checks ensure automated evaluations remain trustworthy.</p></li>
<li><p><strong>Structure roadmaps around experiments, not features.</strong> Commit to a cadence of experimentation and learning rather than specific outcomes by specific dates.</p></li>
</ol>
<p>These principles apply regardless of your domain, team size, or technical stack. They’ve worked for companies ranging from early-stage startups to tech giants, across use cases from customer support to code generation.</p>
<section id="resources-for-going-deeper" class="level3">
<h3 class="anchored" data-anchor-id="resources-for-going-deeper">Resources for Going Deeper</h3>
<p>If you’d like to explore these topics further, here are some resources that might help:</p>
<ul>
<li><p><a href="https://ai.hamel.dev/">My blog</a> for more content on AI evaluation and improvement. My other posts dive into more technical detail on topics such as constructing effective LLM judges, implementing evaluation systems, and other aspects of AI development<sup>1</sup>. Also check out the blogs of <a href="https://www.sh-reya.com/">Shreya Shankar</a> and <a href="https://eugeneyan.com/">Eugene Yan</a> who are also great sources of information on these topics.</p></li>
<li><p>A course I’m teaching: <strong><a href="https://bit.ly/evals-ai">Rapidly Improve AI Products With Evals</a></strong>, with Shreya Shankar. The course provides hands-on experience with techniques such as error analysis, synthetic data generation, and building trustworthy evaluation systems. It includes practical exercises and personalized instruction through office hours.</p></li>
<li><p>If you’re looking for hands-on guidance specific to your organization’s needs, you can learn more about working with me at <a href="https://parlance-labs.com/">Parlance Labs</a>.</p></li>
</ul>


</section>
</section>


<div id="quarto-appendix" class="default"><section id="footnotes" class="footnotes footnotes-end-of-document"><h2 class="anchored quarto-appendix-heading">Footnotes</h2>

<ol>
<li id="fn1"><p>I write more broadly about machine learning, AI, and software development. Some posts that expand on these topics include <a href="https://hamel.dev/blog/posts/evals/">Your AI Product Needs Evals</a>, <a href="https://hamel.dev/blog/posts/llm-judge/">Creating a LLM-as-a-Judge That Drives Business Results</a>, and <a href="https://applied-llms.org/">What We’ve Learned From A Year of Building with LLMs</a>. You can see all my posts at <a href="https://hamel.dev/">hamel.dev</a>.↩︎</p></li>
</ol>
</section></div> ]]></description>
  <category>LLMs</category>
  <category>AI</category>
  <guid>https://hamel.dev/blog/posts/field-guide/</guid>
  <pubDate>Mon, 24 Mar 2025 07:00:00 GMT</pubDate>
  <media:content url="https://hamel.dev/blog/posts/field-guide/images/field_guide_2.png" medium="image" type="image/png" height="81" width="144"/>
</item>
<item>
  <title>Thoughts On A Month With Devin</title>
  <link>https://hamel.dev/</link>
  <description>Our impressions of Devin after giving it 20+ tasks.
</description>
  <category>LLMs</category>
  <guid>https://hamel.dev/</guid>
  <pubDate>Sun, 19 Jan 2025 08:00:00 GMT</pubDate>
</item>
<item>
  <title>nbsanity - Share Notebooks as Polished Web Pages in Seconds</title>
  <link>https://hamel.dev/</link>
  <description>Transform your GitHub Jupyter notebooks into beautiful, readable web pages with a single URL change. No setup required.
</description>
  <category>Jupyter</category>
  <guid>https://hamel.dev/</guid>
  <pubDate>Fri, 13 Dec 2024 08:00:00 GMT</pubDate>
</item>
<item>
  <title>Building an Audience Through Technical Writing: Strategies and Mistakes</title>
  <dc:creator>Hamel Husain</dc:creator>
  <link>https://hamel.dev/blog/posts/audience/</link>
  <description><![CDATA[ 

<!-- Content inserted at the beginning of body tag -->
<!-- Google Tag Manager (noscript) -->
<noscript><iframe src="https://www.googletagmanager.com/ns.html?id=GTM-PKGWQMKL" height="0" width="0" style="display:none;visibility:hidden"></iframe></noscript>
  <!-- End Google Tag Manager (noscript) -->




<p>People often find me through my writing on AI and tech. This creates an interesting pattern. Nearly every week, vendors reach out asking me to write about their products. While I appreciate their interest and love learning about new tools, I reserve my writing for topics that I have personal experience with.</p>
<p>One conversation last week really stuck with me. A founder confided, “We can write the best content in the world, but we don’t have any distribution.” This hit home because I used to think the same way.</p>
<p>Let me share what works for reaching developers. Companies and individuals alike often skip the basics when trying to grow their audience. These are proven approaches I’ve seen succeed, both in my work and in others’ efforts to grow their audience in the AI space.</p>
<section id="build-on-great-work" class="level2">
<h2 class="anchored" data-anchor-id="build-on-great-work">1. Build on Great Work</h2>
<p>Here’s something surprising: few people take the time to thoughtfully engage with others’ work in our field. But when you do, amazing things happen naturally.</p>
<p>For example, here are some recent posts I’ve enjoyed that present opportunities to engage with others:</p>
<ul>
<li>Shreya Shankar’s <a href="https://data-people-group.github.io/blogs/2024/09/24/docetl/">DocETL</a></li>
<li>Eugene Yan’s work on <a href="https://eugeneyan.com/writing/aligneval/">AlignEval</a></li>
<li>Ben Claive’s work on <a href="https://www.answer.ai/posts/2024-09-16-rerankers.html">rerankers</a></li>
<li>Jeremy Howard’s work on <a href="https://www.answer.ai/posts/2024-09-03-llmstxt.html">llms.txt</a></li>
</ul>
<p>In the above examples, you could share how their ideas connect with what you’ve built. You could add additional case studies and real-world insights. If you deeply engage with someone’s work and add your insights, they often share your content with their audience. Not because you asked, but because you’ve added something meaningful to their work. Swyx has written a <a href="https://www.swyx.io/puwtpd">great post</a> on how to do this effectively.</p>
<p>The key is authenticity. Don’t do this just for marketing—do it because you’re genuinely interested in learning from others and building on their ideas. It’s not hard to find things to be excited about. I’m amazed by how few people take this approach. It’s both effective and fun.</p>
</section>
<section id="show-up-consistently" class="level2">
<h2 class="anchored" data-anchor-id="show-up-consistently">2. Show Up Consistently</h2>
<p>I see too many folks blogging or posting once every few months and wondering why they’re not getting traction. Want to know what actually works? Look at <a href="https://x.com/jxnlco">Jason Liu</a>. He grew his following from 500 to 30,000 followers by posting ~ 30 times every day for a year.</p>
<p>You don’t have to post that often (I certainly don’t!), but consistency matters more than perfection. Finally, don’t just post into the void. Engage with others. When someone comments on your post, reply thoughtfully. When you see conversations where you can add value, provide helpful information.</p>
<p>Finally, don’t be discouraged if you don’t see results immediately. Here’s some advice from my friend (and prolific writer), <a href="https://eugeneyan.com/">Eugene Yan</a>:</p>
<blockquote class="blockquote">
<p>In the beginning, when most people start writing, the output’s gonna suck. Harsh, but true—my first 100 posts or so were crap. But with practice, people can get better. But they have to be deliberate in wanting to practice and get better with each piece, and not just write for the sake of publishing something and tweeting about it. The Sam Parr course (see below) is a great example of deliberate practice on copywriting.</p>
</blockquote>
</section>
<section id="get-better-at-copywriting" class="level2">
<h2 class="anchored" data-anchor-id="get-better-at-copywriting">3. Get Better at Copywriting</h2>
<p>This changed everything for me. I took <a href="https://copythat.com/">Sam Parr’s copywriting course</a> just 30 minutes a day for a week. Now I keep my favorite writing samples in a Claude project and reference them when I’m writing something important. Small improvements in how you communicate can make a huge difference in how your content lands.</p>
<p>One thing Sam teaches is that big words don’t make you sound smart. Clear writing that avoids jargon is more effective. That’s why Sam teaches aiming for a 6th-grade reading level. This matters even more with AI, as AI loves to generate flowery language and long sentences. The <a href="https://hemingwayapp.com/">Hemingway App</a> can be helpful in helping you simplify your writing.<sup>1</sup></p>
</section>
<section id="build-a-voice-to-content-pipeline" class="level2">
<h2 class="anchored" data-anchor-id="build-a-voice-to-content-pipeline">4. Build a Voice-to-Content Pipeline</h2>
<p>The struggle most people have with creating content is that it takes too much time. But it doesn’t have to if you build the right systems, especially with AI.</p>
<p>Getting this system right takes some upfront work, but the payoff is enormous. Start by installing a good voice-to-text app on your phone. I use either <a href="https://superwhisper.com/">Superwhisper</a> or <a href="https://voicepal.me/">VoicePal</a>. VoicePal is great for prompting you to elaborate with follow-up questions. These tools let me capture ideas at their best. That’s usually when I’m walking outside or away from my computer. At my computer, I use <a href="https://www.flowvoice.ai/">Flow</a>.</p>
<p>The key is to carefully craft your first few pieces of content. These become examples for your prompts that teach AI your style and tone. Once you have high-quality examples, you can organize these (transcript, content) pairs and feed them to language models. The in-context learning creates remarkably aligned output that matches your writing style while maintaining the authenticity of your original thoughts.</p>
<p>For example, I use this pipeline at Answer AI. We have started interviewing each other and using the recordings as grounding for blog posts. Our recent <a href="https://www.answer.ai/posts/2024-11-07-solveit.html">post about SolveIt</a> shows this in action. The raw conversation is the foundation. Our workflow turns it into polished content.</p>
<p>I’ve also integrated this workflow into my meetings. Using <a href="https://circleback.ai/?via=hamel">CircleBack</a>, my favorite AI note-taking app, I can automatically capture and process meeting discussions. You can set up workflows to send your meeting notes and transcripts to AI for processing. This turns conversations into content opportunities.</p>
<p>The real power comes from having all these pieces working together. Voice capture, AI, and automation makes content creation fun and manageable.</p>
</section>
<section id="leverage-your-unique-perspective" class="level2">
<h2 class="anchored" data-anchor-id="leverage-your-unique-perspective">5. Leverage Your Unique Perspective</h2>
<p>Through my consulting work, I notice patterns that others miss. My most popular posts address common problems my clients had. When everyone’s confused about a topic, especially in AI where there’s lots of hype, clear explanations are gold. This is the motivation for some of my blog posts like:</p>
<ul>
<li><a href="https://hamel.dev/blog/posts/prompt/">Fuck You, Show Me The Prompt</a></li>
<li><a href="https://hamel.dev/blog/posts/evals/">Your AI Product Needs Evals</a></li>
<li><a href="https://hamel.dev/blog/posts/llm-judge/">Creating a LLM-as-a-Judge That Drives Business Results</a></li>
</ul>
<p>You probably see patterns too. Maybe it’s common questions from customers, or problems you’ve solved repeatedly. Maybe you work with a unique set of technologies or interesting use cases. Share these insights! Your unique perspective is more valuable than you think.</p>
</section>
<section id="use-high-quality-social-cards-threads-and-scheduling" class="level2">
<h2 class="anchored" data-anchor-id="use-high-quality-social-cards-threads-and-scheduling">6. Use High Quality Social Cards, Threads, and Scheduling</h2>
<p>This is probably the least important part of the process, but it’s still important. Thumbnails and social cards are vital for visibility on social media. Here are the tools I use:</p>
<ul>
<li><a href="https://socialsharepreview.com/">socialsharepreview.com</a> to check how your content looks on different platforms. For X, I sometimes use the <a href="https://cards-dev.twitter.com/validator">Twitter Card Validator</a>.</li>
<li><a href="https://chatgpt.com/">chatGPT</a> to create cover images for my posts. Then, paste them into Canva to size and edit them. Some of my friends use <a href="https://ideogram.ai/">ideogram</a>, which generates images with text accurately.</li>
<li><a href="https://www.canva.com/">Canva</a> for the last mile of creating social cards. They have easy-to-use buttons to ensure you get the dimensions right. They also have inpainting, background removal, and more.</li>
<li>If using X, social cards can be a bit fiddly. As of this writing, they do not show your post title, just the image if using the large-image size. To mitigate this,I use Canva to write the post’s title in the image <a href="https://hamel.dev/blog/posts/audience/content_2.png">like this</a>.</li>
<li>Social media can be distracting, so I like to schedule my posts in advance. I use <a href="https://typefully.com/">typefully</a> for this purpose. Some of my friends use <a href="https://hypefury.com/">hypefury</a>.</li>
</ul>
<p>Finally, when posting on X, threads can be a great way to raise the visibility of your content. A simple approach is to take screenshots or copy-paste snippets of your content. Then, walk through them in a thread, as you would want a reader to. Jeremy Howard does a great job at this: <a href="https://x.com/jeremyphoward/status/1818036923304456492">example 1</a>, <a href="https://x.com/jeremyphoward/status/1831089138571133290">example 2</a>.</p>
</section>
<section id="the-content-flywheel-putting-it-all-together" class="level2">
<h2 class="anchored" data-anchor-id="the-content-flywheel-putting-it-all-together">The Content Flywheel: Putting It All Together</h2>
<p>Once you have these systems in place, something magical happens: content creates more content. Your blog posts spawn social media updates. Your conversations turn into newsletters. Your client solutions become case studies. Each piece of work feeds the next, creating a natural flywheel.</p>
<p>Don’t try to sell too hard. Instead, share real insights and helpful information. Focus on adding value and educating your audience. When you do this well, people will want to follow your work.</p>
<p>This journey is different for everyone. These are just the patterns I’ve seen work in my consulting practice and my own growth. Try what feels right. Adjust what doesn’t.</p>
<p>P.S. If you’d like to follow my writing journey, you can <a href="https://ai.hamel.dev/">stay connected here</a>.</p>
</section>
<section id="further-reading" class="level2">
<h2 class="anchored" data-anchor-id="further-reading">Further Reading</h2>
<ul>
<li><a href="https://simonwillison.net/tags/writing/">Simon Willison’s Posts on Writing</a></li>
<li><a href="https://eugeneyan.com/tag/writing/">Eugene’s Posts on Writing</a></li>
<li><a href="https://medium.com/@racheltho/why-you-yes-you-should-blog-7d2544ac1045">Why you, (yes, you) should blog</a></li>
</ul>


</section>


<div id="quarto-appendix" class="default"><section id="footnotes" class="footnotes footnotes-end-of-document"><h2 class="anchored quarto-appendix-heading">Footnotes</h2>

<ol>
<li id="fn1"><p>Don’t abuse these tools or use them blindly. There’s <a href="https://x.com/swyx/status/1863352038597558712">plenty of situations where you should not be writing at a 6th-grade reading level</a>. This includes, humor, poetry, shitposting, and more. Even formal writing shouldn’t adhere strictly to this rule. It’s advice that you should judge on a case-by-case basis. When you simplify your writing - do you like it more?↩︎</p></li>
</ol>
</section></div> ]]></description>
  <category>Writing</category>
  <guid>https://hamel.dev/blog/posts/audience/</guid>
  <pubDate>Sat, 30 Nov 2024 08:00:00 GMT</pubDate>
  <media:content url="https://hamel.dev/blog/posts/audience/content_2.png" medium="image" type="image/png" height="81" width="144"/>
</item>
<item>
  <title>Using LLM-as-a-Judge For Evaluation: A Complete Guide</title>
  <dc:creator>Hamel Husain</dc:creator>
  <link>https://hamel.dev/blog/posts/llm-judge/</link>
  <description><![CDATA[ 

<!-- Content inserted at the beginning of body tag -->
<!-- Google Tag Manager (noscript) -->
<noscript><iframe src="https://www.googletagmanager.com/ns.html?id=GTM-PKGWQMKL" height="0" width="0" style="display:none;visibility:hidden"></iframe></noscript>
  <!-- End Google Tag Manager (noscript) -->




<p>Earlier this year, I wrote <a href="https://hamel.dev/blog/posts/evals/">Your AI product needs evals</a>. Many of you asked, “How do I get started with LLM-as-a-judge?” This guide shares what I’ve learned after helping over <a href="https://parlance-labs.com/">30 companies</a> set up their evaluation systems.</p>
<section id="the-problem-ai-teams-are-drowning-in-data" class="level2">
<h2 class="anchored" data-anchor-id="the-problem-ai-teams-are-drowning-in-data">The Problem: AI Teams Are Drowning in Data</h2>
<p>Ever spend weeks building an AI system, only to realize you have no idea if it’s actually working? You’re not alone. I’ve noticed teams repeat the same mistakes when using LLMs to evaluate AI outputs:</p>
<ol type="1">
<li><strong>Too Many Metrics</strong>: Creating numerous measurements that become unmanageable.</li>
<li><strong>Arbitrary Scoring Systems</strong>: Using uncalibrated scales (like 1-5) across multiple dimensions, where the difference between scores is unclear and subjective. What makes something a 3 versus a 4? Nobody knows, and different evaluators often interpret these scales differently.</li>
<li><strong>Ignoring Domain Experts</strong>: Not involving the people who understand the subject matter deeply.</li>
<li><strong>Unvalidated Metrics</strong>: Using measurements that don’t truly reflect what matters to the users or the business.</li>
</ol>
<p>The result? Teams end up buried under mountains of metrics or data they don’t trust and can’t use. Progress grinds to a halt. Everyone gets frustrated.</p>
<p>For example, it’s not uncommon for me to see dashboards that look like this:</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://hamel.dev/blog/posts/llm-judge/blog_header.png" class="img-fluid figure-img"></p>
<figcaption>An illustrative example of a bad eval dashboard</figcaption>
</figure>
</div>
<p>Tracking a bunch of scores on a 1-5 scale is often a sign of a bad eval process (I’ll discuss why later). In this post, I’ll show you how to avoid these pitfalls. The solution is to use a technique that I call <strong>“Critique Shadowing”</strong>. Here’s how to do it, step by step.</p>
</section>
<section id="step-1-find-the-principal-domain-expert" class="level2">
<h2 class="anchored" data-anchor-id="step-1-find-the-principal-domain-expert">Step 1: Find <em>The</em> Principal Domain Expert</h2>
<p>In most organizations there is usually one (maybe two) key individuals whose judgment is crucial for the success of your AI product. These are the people with deep domain expertise or represent your target users. Identifying and involving this <strong>Principal Domain Expert</strong> early in the process is critical.</p>
<p><strong>Why is finding the right domain expert so important?</strong></p>
<ul>
<li><p><strong>They Set the Standard</strong>: This person not only defines what is acceptable technically, but also helps you understand if you’re building something users actually want.</p></li>
<li><p><strong>Capture Unspoken Expectations</strong>: By involving them, you uncover their preferences and expectations, which they might not be able to fully articulate upfront. Through the evaluation process, you help them clarify what a “passable” AI interaction looks like.</p></li>
<li><p><strong>Consistency in Judgment</strong>: People in your organization may have different opinions about the AI’s performance. Focusing on the principal expert ensures that evaluations are consistent and aligned with the most critical standards.</p></li>
<li><p><strong>Sense of Ownership</strong>: Involving the expert gives them a stake in the AI’s development. They feel invested because they’ve had a hand in shaping it. In the end, they are more likely to approve of the AI.</p></li>
</ul>
<p><strong>Examples of Principal Domain Experts:</strong></p>
<ul>
<li>A <strong>psychologist</strong> for a mental health AI assistant.</li>
<li>A <strong>lawyer</strong> for an AI that analyzes legal documents.</li>
<li>A <strong>customer service director</strong> for a support chatbot.</li>
<li>A <strong>lead teacher or curriculum developer</strong> for an educational AI tool.</li>
</ul>
<div class="callout callout-style-default callout-note callout-titled">
<div class="callout-header d-flex align-content-center">
<div class="callout-icon-container">
<i class="callout-icon"></i>
</div>
<div class="callout-title-container flex-fill">
<span class="screen-reader-only">Note</span>Exceptions
</div>
</div>
<div class="callout-body-container callout-body">
<p>In a smaller company, this might be the CEO or founder. If you are an independent developer, you should be the domain expert (but be honest with yourself about your expertise).</p>
<p>If you must rely on leadership, you should regularly validate their assumptions against real user feedback.</p>
</div>
</div>
<p>Many developers attempt to act as the domain expert themselves, or find a convenient proxy (ex: their superior). This is a recipe for disaster. People will have varying opinions about what is acceptable, and you can’t make everyone happy. What’s important is that your principal domain expert is satisfied.</p>
<p><strong>Remember:</strong> This doesn’t have to take a lot of the domain expert’s time. Later in this post, I’ll discuss how you can make the process efficient. Their involvement is absolutely critical to the AI’s success.</p>
<section id="next-steps" class="level3">
<h3 class="anchored" data-anchor-id="next-steps">Next Steps</h3>
<p>Once you’ve found your expert, we need to give them the right data to review. Let’s talk about how to do that next.</p>
</section>
</section>
<section id="step-2-create-a-dataset" class="level2">
<h2 class="anchored" data-anchor-id="step-2-create-a-dataset">Step 2: Create a Dataset</h2>
<p>With your principal domain expert on board, the next step is to build a dataset that captures problems that your AI will encounter. It’s important that the dataset is diverse and represents the types of interactions that your AI will have in production.</p>
<section id="why-a-diverse-dataset-matters" class="level3">
<h3 class="anchored" data-anchor-id="why-a-diverse-dataset-matters">Why a Diverse Dataset Matters</h3>
<ul>
<li><strong>Comprehensive Testing</strong>: Ensures your AI is evaluated across a wide range of situations.</li>
<li><strong>Realistic Interactions</strong>: Reflects actual user behavior for more relevant evaluations.</li>
<li><strong>Identifies Weaknesses</strong>: Helps uncover areas where the AI may struggle or produce errors.</li>
</ul>
</section>
<section id="dimensions-for-structuring-your-dataset" class="level3">
<h3 class="anchored" data-anchor-id="dimensions-for-structuring-your-dataset">Dimensions for Structuring Your Dataset</h3>
<p>You want to define dimensions that make sense for your use case. For example, here are ones that I often use for B2C applications:</p>
<ol type="1">
<li><strong>Features</strong>: Specific functionalities of your AI product.</li>
<li><strong>Scenarios</strong>: Situations or problems the AI may encounter and needs to handle.</li>
<li><strong>Personas</strong>: Representative user profiles with distinct characteristics and needs.</li>
</ol>
</section>
<section id="examples-of-features-scenarios-and-personas" class="level3">
<h3 class="anchored" data-anchor-id="examples-of-features-scenarios-and-personas">Examples of Features, Scenarios, and Personas</h3>
<section id="features" class="level4">
<h4 class="anchored" data-anchor-id="features">Features</h4>
<table class="caption-top table">
<colgroup>
<col style="width: 30%">
<col style="width: 69%">
</colgroup>
<thead>
<tr class="header">
<th><strong>Feature</strong></th>
<th><strong>Description</strong></th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td><strong>Email Summarization</strong></td>
<td>Condensing lengthy emails into key points.</td>
</tr>
<tr class="even">
<td><strong>Meeting Scheduler</strong></td>
<td>Automating the scheduling of meetings across time zones.</td>
</tr>
<tr class="odd">
<td><strong>Order Tracking</strong></td>
<td>Providing shipment status and delivery updates.</td>
</tr>
<tr class="even">
<td><strong>Contact Search</strong></td>
<td>Finding and retrieving contact information from a database.</td>
</tr>
<tr class="odd">
<td><strong>Language Translation</strong></td>
<td>Translating text between languages.</td>
</tr>
<tr class="even">
<td><strong>Content Recommendation</strong></td>
<td>Suggesting articles or products based on user interests.</td>
</tr>
</tbody>
</table>
</section>
<section id="scenarios" class="level4">
<h4 class="anchored" data-anchor-id="scenarios">Scenarios</h4>
<p>Scenarios are situations the AI needs to handle, (not based on the outcome of the AI’s response).</p>
<table class="caption-top table">
<colgroup>
<col style="width: 31%">
<col style="width: 68%">
</colgroup>
<thead>
<tr class="header">
<th><strong>Scenario</strong></th>
<th><strong>Description</strong></th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td><strong>Multiple Matches Found</strong></td>
<td>User’s request yields multiple results that need narrowing down. For example: User asks “Where’s my order?” but has three active orders (#123, #124, #125). AI must help identify which specific order they’re asking about.</td>
</tr>
<tr class="even">
<td><strong>No Matches Found</strong></td>
<td>User’s request yields no results, requiring alternatives or corrections. For example: User searches for order #ABC-123 which doesn’t exist. AI should explain valid order formats and suggest checking their confirmation email.</td>
</tr>
<tr class="odd">
<td><strong>Ambiguous Request</strong></td>
<td>User input lacks necessary specificity. For example: User says “I need to change my delivery” without specifying which order or what aspect of delivery (date, address, etc.) they want to change.</td>
</tr>
<tr class="even">
<td><strong>Invalid Data Provided</strong></td>
<td>User provides incorrect data type or format. For example: User tries to track a return using a regular order number instead of a return authorization (RMA) number.</td>
</tr>
<tr class="odd">
<td><strong>System Errors</strong></td>
<td>Technical issues prevent normal operation. For example: While looking up an order, the inventory database is temporarily unavailable. AI needs to explain the situation and provide alternatives.</td>
</tr>
<tr class="even">
<td><strong>Incomplete Information</strong></td>
<td>User omits required details. For example: User wants to initiate a return but hasn’t provided the order number or reason. AI needs to collect this information step by step.</td>
</tr>
<tr class="odd">
<td><strong>Unsupported Feature</strong></td>
<td>User requests functionality that doesn’t exist. For example: User asks to change payment method after order has shipped. AI must explain why this isn’t possible and suggest alternatives.</td>
</tr>
</tbody>
</table>
</section>
<section id="personas" class="level4">
<h4 class="anchored" data-anchor-id="personas">Personas</h4>
<table class="caption-top table">
<colgroup>
<col style="width: 27%">
<col style="width: 72%">
</colgroup>
<thead>
<tr class="header">
<th><strong>Persona</strong></th>
<th><strong>Description</strong></th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td><strong>New User</strong></td>
<td>Unfamiliar with the system; requires guidance.</td>
</tr>
<tr class="even">
<td><strong>Expert User</strong></td>
<td>Experienced; expects efficiency and advanced features.</td>
</tr>
<tr class="odd">
<td><strong>Non-Native Speaker</strong></td>
<td>May have language barriers; uses non-standard expressions.</td>
</tr>
<tr class="even">
<td><strong>Busy Professional</strong></td>
<td>Values quick, concise responses; often multitasking.</td>
</tr>
<tr class="odd">
<td><strong>Technophobe</strong></td>
<td>Uncomfortable with technology; needs simple instructions.</td>
</tr>
<tr class="even">
<td><strong>Elderly User</strong></td>
<td>May not be tech-savvy; requires patience and clear guidance.</td>
</tr>
</tbody>
</table>
</section>
</section>
<section id="this-taxonomy-is-not-universal" class="level3">
<h3 class="anchored" data-anchor-id="this-taxonomy-is-not-universal">This taxonomy is not universal</h3>
<p>This taxonomy (features, scenarios, personas) is not universal. For example, it may not make sense to even have personas if users aren’t directly engaging with your AI. The idea is you should outline dimensions that make sense for your use case and generate data that covers them. You’ll likely refine these after the first round of evaluations.</p>
</section>
<section id="generating-data" class="level3">
<h3 class="anchored" data-anchor-id="generating-data">Generating Data</h3>
<p>To build your dataset, you can:</p>
<ul>
<li><strong>Use Existing Data</strong>: Sample real user interactions or behaviors from your AI system.</li>
<li><strong>Generate Synthetic Data</strong>: Use LLMs to create realistic user inputs covering various features, scenarios, and personas.</li>
</ul>
<p>Often, you’ll do a combination of both to ensure comprehensive coverage. Synthetic data is not as good as real data, but it’s a good starting point. Also, we are only using LLMs to generate the user inputs, not the LLM responses or internal system behavior.</p>
<p>Regardless of whether you use existing data or synthetic data, you want good coverage across the dimensions you’ve defined.</p>
<p><strong>Incorporating System Information</strong></p>
<p>When making test data, use your APIs and databases where appropriate. This will create realistic data and trigger the right scenarios. Sometimes you’ll need to write simple programs to get this information. That’s what the “Assumptions” column is referring to in the examples below.</p>
</section>
<section id="example-llm-prompts-for-generating-user-inputs" class="level3">
<h3 class="anchored" data-anchor-id="example-llm-prompts-for-generating-user-inputs">Example LLM Prompts for Generating User Inputs</h3>
<p>Here are some example prompts that illustrate how to use an LLM to generate synthetic <strong>user inputs</strong> for different combinations of features, scenarios, and personas:</p>
<table class="caption-top table">
<colgroup>
<col style="width: 3%">
<col style="width: 11%">
<col style="width: 11%">
<col style="width: 10%">
<col style="width: 35%">
<col style="width: 26%">
</colgroup>
<thead>
<tr class="header">
<th><strong>ID</strong></th>
<th><strong>Feature</strong></th>
<th><strong>Scenario</strong></th>
<th><strong>Persona</strong></th>
<th><strong>LLM Prompt to Generate User Input</strong></th>
<th>Assumptions (not directly in the prompt)</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td>1</td>
<td><strong>Order Tracking</strong></td>
<td>Invalid Data Provided</td>
<td>Frustrated Customer</td>
<td>“Generate a user input from someone who is clearly irritated and impatient, using short, terse language to demand information about their order status for order number <strong>#1234567890</strong>. Include hints of previous negative experiences.”</td>
<td>Order number <strong>#1234567890</strong> does <strong>not</strong> exist in the system.</td>
</tr>
<tr class="even">
<td>2</td>
<td><strong>Contact Search</strong></td>
<td>Multiple Matches Found</td>
<td>New User</td>
<td>“Create a user input from someone who seems unfamiliar with the system, using hesitant language and asking for help to find contact information for a person named ‘Alex’. The user should appear unsure about what information is needed.”</td>
<td>Multiple contacts named ‘Alex’ exist in the system.</td>
</tr>
<tr class="odd">
<td>3</td>
<td><strong>Meeting Scheduler</strong></td>
<td>Ambiguous Request</td>
<td>Busy Professional</td>
<td>“Simulate a user input from someone who is clearly in a hurry, using abbreviated language and minimal details to request scheduling a meeting. The message should feel rushed and lack specific information.”</td>
<td>N/A</td>
</tr>
<tr class="even">
<td>4</td>
<td><strong>Content Recommendation</strong></td>
<td>No Matches Found</td>
<td>Expert User</td>
<td>“Produce a user input from someone who demonstrates in-depth knowledge of their industry, using specific terminology to request articles on sustainable supply chain management. Use the information in this article involving sustainable supply chain management to formulate a plausible query: {{article}}”</td>
<td>No articles on ‘Emerging trends in sustainable supply chain management’ exist in the system.</td>
</tr>
</tbody>
</table>
</section>
<section id="generating-synthetic-data" class="level3">
<h3 class="anchored" data-anchor-id="generating-synthetic-data">Generating Synthetic Data</h3>
<p>When generating synthetic data, you only need to create the user inputs. You then feed these inputs into your AI system to generate the AI’s responses. It’s important that you log everything so you can evaluate your AI. To recap, here’s the process:</p>
<ol type="1">
<li><strong>Generate User Inputs</strong>: Use the LLM prompts to create realistic user inputs.</li>
<li><strong>Feed Inputs into Your AI System</strong>: Input the user interactions into your AI as it currently exists.</li>
<li><strong>Capture AI Responses</strong>: Record the AI’s responses to form complete interactions.</li>
<li><strong>Organize the Interactions</strong>: Create a table to store the user inputs, AI responses, and relevant metadata.</li>
</ol>
<section id="how-much-data-should-you-generate" class="level4">
<h4 class="anchored" data-anchor-id="how-much-data-should-you-generate">How much data should you generate?</h4>
<p>There is no right answer here. At a minimum, you want to generate enough data so that you have examples for each combination of dimensions (in this toy example: features, scenarios, and personas). However, you also want to keep generating more data until you feel like you have stopped seeing new failure modes. The amount of data I generate varies significantly depending on the use case.</p>
</section>
<section id="does-synthetic-data-actually-work" class="level4">
<h4 class="anchored" data-anchor-id="does-synthetic-data-actually-work">Does synthetic data actually work?</h4>
<p>You might be skeptical of using synthetic data. After all, it’s not real data, so how can it be a good proxy? In my experience, it works surprisingly well. Some of my favorite AI products, like <a href="https://hex.tech/">Hex</a> use synthetic data to power their evals:</p>
<blockquote class="blockquote">
<p>“LLMs are surprisingly good at generating excellent - and diverse - examples of user prompts. This can be relevant for powering application features, and sneakily, for building Evals. If this sounds a bit like the Large Language Snake is eating its tail, I was just as surprised as you! All I can say is: it works, ship it.” <em><a href="https://www.linkedin.com/in/bryan-bischof/">Bryan Bischof</a>, Head of AI Engineering at Hex</em></p>
</blockquote>
</section>
</section>
<section id="next-steps-1" class="level3">
<h3 class="anchored" data-anchor-id="next-steps-1">Next Steps</h3>
<p>With your dataset ready, now comes the most important part: getting your principal domain expert to evaluate the interactions.</p>
</section>
</section>
<section id="step-3-direct-the-domain-expert-to-make-passfail-judgments-with-critiques" class="level2">
<h2 class="anchored" data-anchor-id="step-3-direct-the-domain-expert-to-make-passfail-judgments-with-critiques">Step 3: Direct The Domain Expert to Make Pass/Fail Judgments with Critiques</h2>
<p>The domain expert’s job is to focus on one thing: <strong>“Did the AI achieve the desired outcome?”</strong> No complex scoring scales or multiple metrics. Just a clear <strong>pass or fail</strong> decision. In addition to the pass/fail decision, the domain expert should write a critique that explains their reasoning.</p>
<section id="why-are-simple-passfail-metrics-important" class="level3">
<h3 class="anchored" data-anchor-id="why-are-simple-passfail-metrics-important">Why are simple pass/fail metrics important?</h3>
<ul>
<li><p><strong>Clarity and Focus</strong>: A binary decision forces everyone to consider what truly matters. It simplifies the evaluation to a single, crucial question.</p></li>
<li><p><strong>Actionable Insights</strong>: Pass/fail judgments are easy to interpret and act upon. They help you quickly identify whether the AI meets the user’s needs.</p></li>
<li><p><strong>Forces Articulation of Expectations</strong>: When domain experts must decide if an interaction passes or fails, they are compelled to articulate their expectations clearly. This process uncovers nuances and unspoken assumptions about how the AI should behave.</p></li>
<li><p><strong>Efficient Use of Resources</strong>: Keeps the evaluation process manageable, especially when starting out. You avoid getting bogged down in detailed metrics that might not be meaningful yet.</p></li>
</ul>
</section>
<section id="the-role-of-critiques" class="level3">
<h3 class="anchored" data-anchor-id="the-role-of-critiques">The Role of Critiques</h3>
<p>Alongside a binary pass/fail judgment, it’s important to write a detailed critique of the LLM-generated output. These critiques:</p>
<ul>
<li><p><strong>Captures Nuances</strong>: The critique allows you to note if something was mostly correct but had areas for improvement.</p></li>
<li><p><strong>Guide Improvement</strong>: Detailed feedback provides specific insights into how the AI can be enhanced.</p></li>
<li><p><strong>Balance Simplicity with Depth</strong>: While the pass/fail offers a clear verdict, the critique offers the depth needed to understand the reasoning behind the judgment.</p></li>
</ul>
<p><strong>Why Write Critiques?:</strong></p>
<p>In practice, domain experts may not have fully internalized all the judgment criteria. By forcing them to make a pass/fail decision and explain their reasoning, they clarify their expectations and provide valuable guidance for refining the AI.</p>
<p><em>“But my problem is complex!”</em> Trust me—starting simple forces you to focus on what truly matters. You can introduce more complexity later if needed.</p>
</section>
<section id="examples-of-good-critiques" class="level3">
<h3 class="anchored" data-anchor-id="examples-of-good-critiques">Examples of Good Critiques</h3>
<p>To illustrate how simple pass/fail judgments combined with detailed critiques work in practice, here’s a table showcasing examples of user interactions with an AI assistant. The table includes both <strong>passes</strong> and <strong>fails</strong>, with critiques explaining why the AI received that judgment. In cases where the AI passed despite critical issues, the critique highlights these aspects and justifies why it still passed overall. For failed interactions, the critique explains the critical elements that led to the failure.</p>
<table class="caption-top table">
<colgroup>
<col style="width: 51%">
<col style="width: 24%">
<col style="width: 24%">
</colgroup>
<thead>
<tr class="header">
<th><strong>User Interaction with AI</strong></th>
<th><strong>Judgment</strong></th>
<th><strong>Critique</strong></th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td><strong>User:</strong> “I need to cancel my flight booking for next week.”<br><strong>AI:</strong> “Your flight booking for next week has been canceled. A confirmation email has been sent to you.”</td>
<td><strong>Pass</strong></td>
<td>The AI successfully executed the user’s request by canceling the flight and sending a confirmation, achieving the primary goal. <strong>However</strong>, it should have asked for confirmation before proceeding to prevent accidental cancellations. Despite this critical oversight, the AI fulfilled the user’s request, so it <strong>passes</strong>, but adding a confirmation step would improve user safety.</td>
</tr>
<tr class="even">
<td><strong>User:</strong> “What’s the capital of France?”<br><strong>AI:</strong> “The capital of France is Paris.”</td>
<td><strong>Pass</strong></td>
<td>The AI correctly answered the user’s question, providing accurate information promptly. <strong>Although</strong> the response is sufficient, the AI could enhance the interaction by offering additional interesting facts or asking if the user has more questions. Despite this opportunity for added engagement, the AI met the user’s primary need, so it <strong>passes</strong>.</td>
</tr>
<tr class="odd">
<td><strong>User:</strong> “I can’t log into my account; it says my password is incorrect.”<br><strong>AI:</strong> “Please reset your password by clicking ‘Forgot Password’.”</td>
<td><strong>Fail</strong></td>
<td>The AI provided a generic solution without acknowledging the user’s frustration or checking for other issues like account lockout. It failed to offer empathetic support or additional assistance. Due to the lack of personalized help and poor user experience, it <strong>fails</strong>.</td>
</tr>
<tr class="even">
<td><strong>User:</strong> “Book a table for two at an Italian restaurant tonight at 7 PM.”<br><strong>AI:</strong> “Table for two at Luigi’s Italian Restaurant booked for tonight at 7 PM. Confirmation number is 12345.”</td>
<td><strong>Pass</strong></td>
<td>The AI efficiently completed the booking as requested, providing all necessary details and a confirmation number. <strong>However</strong>, it could have asked if the user had seating preferences or dietary restrictions. The AI accomplished the primary task successfully, so it <strong>passes</strong>, with minor opportunities for enhanced service.</td>
</tr>
<tr class="odd">
<td><strong>User:</strong> “I need help with my billing statement.”<br><strong>AI:</strong> “Please provide your account number and full name.”</td>
<td><strong>Fail</strong></td>
<td>The AI requested sensitive information without verifying the user’s identity through secure means, potentially compromising security. It should guide the user through a secure authentication process first. Due to this critical oversight in user data protection, it <strong>fails</strong>.</td>
</tr>
</tbody>
</table>
<p>These examples demonstrate how the AI can receive both <strong>“Pass”</strong> and <strong>“Fail”</strong> judgments. In the critiques:</p>
<ul>
<li><p>For <strong>passes</strong>, we explain why the AI succeeded in meeting the user’s primary need, even if there were critical aspects that could be improved. We highlight these areas for enhancement while justifying the overall passing judgment.</p></li>
<li><p>For <strong>fails</strong>, we identify the critical elements that led to the failure, explaining why the AI did not meet the user’s main objective or compromised important factors like user experience or security.</p></li>
</ul>
<p>Most importantly, <strong>the critique should be detailed enough so that you can use it in a few-shot prompt for a LLM judge</strong>. In other words, it should be detailed enough that a new employee could understand it. Being too terse is a common mistake.</p>
<p>Note that the example user interactions with the AI are simplified for brevity - but you might need to give the domain expert more context to make a judgement. More on that later.</p>
<div class="callout callout-style-default callout-note callout-titled">
<div class="callout-header d-flex align-content-center">
<div class="callout-icon-container">
<i class="callout-icon"></i>
</div>
<div class="callout-title-container flex-fill">
Note
</div>
</div>
<div class="callout-body-container callout-body">
<p>At this point, you don’t need to perform a root cause analysis into the technical reasons behind why the AI failed. Many times, it’s useful to get a sense of overall behavior before diving into the weeds.</p>
</div>
</div>
</section>
<section id="dont-stray-from-binary-passfail-judgments-when-starting-out" class="level3">
<h3 class="anchored" data-anchor-id="dont-stray-from-binary-passfail-judgments-when-starting-out">Don’t stray from binary pass/fail judgments when starting out</h3>
<p>A common mistake is straying from binary pass/fail judgments. Let’s revisit the dashboard from earlier:</p>
<p><img src="https://hamel.dev/blog/posts/llm-judge/dashboard.png" class="img-fluid"></p>
<p>If your evaluations consist of a bunch of metrics that LLMs score on a 1-5 scale (or any other scale), you’re doing it wrong. Let’s unpack why.</p>
<ol type="1">
<li><strong>It’s not actionable</strong>: People don’t know what to do with a 3 or 4. It’s not immediately obvious how this number is better than a 2. You need to be able to say “this interaction passed because…” and “this interaction failed because…”.</li>
<li>More often than not, <strong>these metrics do not matter</strong>. Every time I’ve analyzed data on domain expert judgments, they tend not to correlate with these kind of metrics. By having a domain expert make a binary judgment, you can figure out what truly matters.</li>
</ol>
<p>This is why I hate off the shelf metrics that come with many evaluation frameworks. They tend to lead people astray.</p>
<p><strong>Common Objections to Pass/Fail Judgments:</strong></p>
<ul>
<li>“The business said that these 8 dimensions are important, so we need to evaluate all of them.”</li>
<li>“We need to be able to say why an interaction passed or failed.”</li>
</ul>
<p>I can guarantee you that if someone says you need to measure 8 things on a 1-5 scale, they don’t know what they are looking for. They are just guessing. You have to let the domain expert drive and make a pass/fail judgment with critiques so you can figure out what truly matters. Stand your ground here.</p>
</section>
<section id="make-it-easy-for-the-domain-expert-to-review-data" class="level3">
<h3 class="anchored" data-anchor-id="make-it-easy-for-the-domain-expert-to-review-data">Make it easy for the domain expert to review data</h3>
<p>Finally, you need to remove all friction from reviewing data. I’ve written about this <a href="../../../notes/llm/finetuning/data_cleaning.html">here</a>. Sometimes, you can just use a spreadsheet. It’s a judgement call in terms of what is easiest for the domain expert. I found that I often have to provide additional context to help the domain expert understand the user interaction, such as:</p>
<ul>
<li>Metadata about the user, such as their location, subscription tier, etc.</li>
<li>Additional context about the system, such as the current time, inventory levels, etc.</li>
<li>Resources so you can check if the AI’s response is correct (ex: ability to search a database, etc.)</li>
</ul>
<p>All of this data needs to be presented on a single screen so the domain expert can review it without jumping through hoops. That’s why I recommend building <a href="../../../notes/llm/finetuning/data_cleaning.html">a simple web app</a> to review data.</p>
</section>
<section id="how-many-examples-do-you-need" class="level3">
<h3 class="anchored" data-anchor-id="how-many-examples-do-you-need">How many examples do you need?</h3>
<p>The number of examples you need depends on the complexity of the task. My heuristic is that I start with around 30 examples and keep going until I do not see any new failure modes. From there, I keep going until I’m not learning anything new.</p>
<p>Next, we’ll look at how to use this data to build an LLM judge.</p>
</section>
</section>
<section id="step-4-fix-errors" class="level2">
<h2 class="anchored" data-anchor-id="step-4-fix-errors">Step 4: Fix Errors</h2>
<p>After looking at the data, it’s likely you will find errors in your AI system. Instead of plowing ahead and building an LLM judge, you want to fix any obvious errors. Remember, the whole point of the LLM as a judge is to help you find these errors, so it’s totally fine if you find them earlier!</p>
<p>If you have already developed <a href="https://hamel.dev/blog/posts/evals">Level 1 evals as outlined in my previous post</a>, you should not have any pervasive errors. However, these errors can sometimes slip through the cracks. If you find pervasive errors, fix them and go back to step 3. Keep iterating until you feel like you have stabilized your system.</p>
</section>
<section id="step-5-build-your-llm-as-a-judge-iteratively" class="level2">
<h2 class="anchored" data-anchor-id="step-5-build-your-llm-as-a-judge-iteratively">Step 5: Build Your LLM as A Judge, Iteratively</h2>
<section id="the-hidden-power-of-critiques" class="level3">
<h3 class="anchored" data-anchor-id="the-hidden-power-of-critiques">The Hidden Power of Critiques</h3>
<p>You cannot write a good judge prompt until you’ve seen the data. <a href="https://arxiv.org/abs/2404.12272">The paper from Shankar et al.,</a> “Who Validates the Validators? Aligning LLM-Assisted Evaluation of LLM Outputs with Human Preferences” summarizes this well:</p>
<blockquote class="blockquote">
<p>to grade outputs,people need to externalize and define their evaluation criteria; however, the process of grading outputs helps them to define that very criteria. We dub this phenomenon criteria drift, and it implies thatit is impossible to completely determine evaluation criteria prior to human judging of LLM outputs.</p>
</blockquote>
</section>
<section id="start-with-expert-examples" class="level3">
<h3 class="anchored" data-anchor-id="start-with-expert-examples">Start with Expert Examples</h3>
<p>Let me share a real-world example of building an LLM judge you can apply to your own use case. When I was helping Honeycomb build their <a href="https://www.honeycomb.io/blog/introducing-query-assistant">Query Assistant feature</a>, we needed a way to evaluate if the AI was generating good queries. Here’s what our LLM judge prompt looked like, including few-shot examples of critiques from our domain expert, <a href="https://x.com/_cartermp">Phillip</a>:</p>
<pre class="text"><code>You are a Honeycomb query evaluator with advanced capabilities to judge if a query is good or not.
You understand the nuances of the Honeycomb query language, including what is likely to be
most useful from an analytics perspective. 

Here is information about the Honeycomb query language:
{{query_language_info}}

Here are some guidelines for evaluating queries:
{{guidelines}}

Example evaluations:

&lt;examples&gt;

&lt;example-1&gt;
&lt;nlq&gt;show me traces where ip is 10.0.2.90&lt;/nlq&gt;
&lt;query&gt;
{
  "breakdowns": ["trace.trace_id"],
  "calculations": [{"op": "COUNT"}],
  "filters": [{"column": "net.host.ip", "op": "=", "value": "10.0.2.90"}]
}
&lt;/query&gt;
&lt;critique&gt;
{
  "critique": "The query correctly filters for traces with an IP address of 10.0.2.90 
   and counts the occurrences of those traces, grouped by trace.trace_id. The response 
   is good as it meets the requirement of showing traces from a specific IP address 
   without additional complexities.",
  "outcome": "good"
}
&lt;/critique&gt;
&lt;/example-1&gt;

&lt;example-2&gt;
&lt;nlq&gt;show me slowest trace&lt;/nlq&gt;
&lt;query&gt;
{
  "calculations": [{"column": "duration_ms", "op": "MAX"}],
  "orders": [{"column": "duration_ms", "op": "MAX", "order": "descending"}],
  "limit": 1
}
&lt;/query&gt;
&lt;critique&gt;
{
  "critique": "While the query attempts to find the slowest trace using MAX(duration_ms) 
   and ordering correctly, it fails to group by trace.trace_id. Without this grouping, 
   the query only shows the MAX(duration_ms) measurement over time, not the actual 
   slowest trace.",
  "outcome": "bad"
}
&lt;/critique&gt;
&lt;/example-2&gt;

&lt;example-3&gt;
&lt;nlq&gt;count window-hash where window-hash exists per hour&lt;/nlq&gt;
&lt;query&gt;
{
  "breakdowns": ["window-hash"],
  "calculations": [{"op": "COUNT"}],
  "filters": [{"column": "window-hash", "op": "exists"}],
  "time_range": 3600
}
&lt;/query&gt;
&lt;critique&gt;
{
  "critique": "While the query correctly counts window-hash occurrences, the time_range 
   of 3600 seconds (1 hour) is insufficient for per-hour analysis. When we say 'per hour', 
   we need a time_range of at least 36000 seconds to show meaningful hourly patterns.",
  "outcome": "bad"
}
&lt;/critique&gt;
&lt;/example-3&gt;

&lt;/examples&gt;

For the following query, first write a detailed critique explaining your reasoning,
then provide a pass/fail judgment in the same format as above.

&lt;nlq&gt;{{user_input}}&lt;/nlq&gt;
&lt;query&gt;
{{generated_query}}
&lt;/query&gt;
&lt;critique&gt;</code></pre>
<p>Notice how each example includes:</p>
<ol type="1">
<li>The natural language query (NLQ) in <code>&lt;nlq&gt;</code> tags</li>
<li>The generated query in <code>&lt;query&gt;</code> tags</li>
<li>The critique and outcome in <code>&lt;critique&gt;</code> tags</li>
</ol>
<p>In the prompt above, the example critiques are fixed. An advanced approach is to include examples dynamically based upon the item you are judging. You can learn more in <a href="https://blog.langchain.dev/dosu-langsmith-no-prompt-eng/">this post about Continual In-Context Learning</a>.</p>
</section>
<section id="keep-iterating-on-the-prompt-until-convergence-with-domain-expert" class="level3">
<h3 class="anchored" data-anchor-id="keep-iterating-on-the-prompt-until-convergence-with-domain-expert">Keep Iterating on the Prompt Until Convergence With Domain Expert</h3>
<p>In this case, I used a low-tech approach to iterate on the prompt. I sent Phillip a spreadsheet with the following information:</p>
<ol type="1">
<li>The NLQ</li>
<li>The generated query</li>
<li>The critique</li>
<li>The outcome (pass or fail)</li>
</ol>
<p>Phillip would then fill out his own version of the spreadsheet with his critiques. I used this to iteratively improve the prompt. The spreadsheet looked like this:</p>
<p><img src="https://hamel.dev/blog/posts/llm-judge/spreadsheet.png" class="img-fluid"></p>
<p>I also tracked agreement rates over time to ensure we were converging on a good prompt.</p>
<p><img src="https://hamel.dev/blog/posts/llm-judge/score.png" class="img-fluid"></p>
<div class="callout callout-style-default callout-warning callout-titled">
<div class="callout-header d-flex align-content-center">
<div class="callout-icon-container">
<i class="callout-icon"></i>
</div>
<div class="callout-title-container flex-fill">
<span class="screen-reader-only">Warning</span>Important Note on Using Agreement as a Metric
</div>
</div>
<div class="callout-body-container callout-body">
<p>In this example, we used agreement between the model and human evaluator because our dataset was roughly balanced (about 50% of instances were failures). However, using raw agreement is generally not recommended and can be misleading when classes are imbalanced. Instead, you should typically measure <a href="https://developers.google.com/machine-learning/crash-course/classification/accuracy-precision-recall">precision and recall</a> separately to get a more accurate picture of your judge’s alignment.</p>
</div>
</div>
<p>It took us only three iterations to achieve &gt; 90% agreement between the LLM and Phillip. Your mileage may vary depending on the complexity of the task. For example, <a href="https://humanloop.com/blog/why-your-product-needs-evals">Swyx has conducted a similar process hundreds of times</a> for <a href="https://www.latent.space/">AI News</a>, an <a href="https://x.com/swyx/status/1672306744884887553">extremely popular</a> news aggregator with high quality recommendations. The quality of the AI owing to this process is why this product has received <a href="https://buttondown.com/ainews">critical acclaim</a>.</p>
</section>
<section id="how-to-optimize-the-llm-judge-prompt" class="level3">
<h3 class="anchored" data-anchor-id="how-to-optimize-the-llm-judge-prompt">How to Optimize the LLM Judge Prompt?</h3>
<p>I usually adjust the prompts by hand. I haven’t had much luck with prompt optimizers like DSPy. However, my friend <a href="https://eugeneyan.com/">Eugene Yan</a> has just released a promising tool named <a href="https://eugeneyan.com/writing/aligneval/">ALIGN Eval</a>. I like it because it’s simple and effective. Also, don’t forget the approach of <a href="https://blog.langchain.dev/dosu-langsmith-no-prompt-eng/">continual in-context learning</a> mentioned earlier - it can be effective when implemented correctly.</p>
<p>In rare cases, I might fine-tune a judge, but I prefer not to. I talk about this more in the FAQ section.</p>
</section>
<section id="the-human-side-of-the-process" class="level3">
<h3 class="anchored" data-anchor-id="the-human-side-of-the-process">The Human Side of the Process</h3>
<p>Something unexpected happened during this process. <a href="https://www.linkedin.com/in/phillip-carter-4714a135/">Phillip Carter</a>, our domain expert at Honeycomb, found that reviewing the LLM’s critiques helped him articulate his own evaluation criteria more clearly. He said,</p>
<blockquote class="blockquote">
<p>“Seeing how the LLM breaks down its reasoning made me realize I wasn’t being consistent about how I judged certain edge cases.”</p>
</blockquote>
<p>This is a pattern I’ve seen repeatedly—the process of building an LLM judge often helps standardize evaluation criteria.</p>
<p>Furthermore, because this process forces the domain expert to look at data carefully, I always uncover new insights about the product, AI capabilities, and user needs. The resulting benefits are often <em>more valuable</em> than creating a LLM judge!</p>
</section>
<section id="how-often-should-you-evaluate" class="level3">
<h3 class="anchored" data-anchor-id="how-often-should-you-evaluate">How Often Should You Evaluate?</h3>
<p>I conduct this human review at regular intervals and whenever something material changes. For example, if I update a model, I’ll run the process again. I don’t get too scientific here; instead, I rely on my best judgment. Also note that after the first two iterations, I tend to focus more on errors rather than sampling randomly. For example, if I find an error, I’ll search for more examples that I think might trigger the same error. However, I always do a bit of random sampling as well.</p>
</section>
<section id="what-if-this-doesnt-work" class="level3">
<h3 class="anchored" data-anchor-id="what-if-this-doesnt-work">What if this doesn’t work?</h3>
<p>I’ve seen this process fail when:</p>
<ul>
<li>The AI is overscoped: Example - a chatbot in a SaaS product that promises to do anything you want.</li>
<li>The process is not followed correctly: Not using the principal domain expert, not writing proper critiques, etc.</li>
<li>The expectations of alignment are unrealistic or not feasible.</li>
</ul>
<p>In each of these cases, I try to address the root cause instead of trying to force alignment. Sometimes, you may not be able to achieve the alignment you want and may have to lean heavier on human annotations. However, after following the process described here, you will have metrics that help you understand how much you can trust the LLM judge.</p>
</section>
<section id="mistakes-ive-noticed-in-llm-judge-prompts" class="level3">
<h3 class="anchored" data-anchor-id="mistakes-ive-noticed-in-llm-judge-prompts">Mistakes I’ve noticed in LLM judge prompts</h3>
<p>Most of the mistakes I’ve seen in LLM judge prompts have to do with not providing good examples:</p>
<ol type="1">
<li>Not providing any critiques.</li>
<li>Writing extremely terse critiques.</li>
<li>Not providing external context. Your examples should contain the same information you use to evaluate, including external information like user metadata, system information etc.</li>
<li>Not providing diverse examples. You need a wide variety of examples to ensure that your judge works for a wide variety of inputs.</li>
</ol>
<p>Sometimes, you may encounter difficulties with fitting everything you need into the prompt, and may have to get creative about how you structure the examples. However, this is becoming less of an issue thanks to expanding context windows and <a href="https://platform.openai.com/docs/guides/prompt-caching">prompt caching</a>.</p>
</section>
</section>
<section id="step-6-perform-error-analysis" class="level2">
<h2 class="anchored" data-anchor-id="step-6-perform-error-analysis">Step 6: Perform Error Analysis</h2>
<p>After you have created a LLM as a judge, you will have a dataset of user interactions with the AI, and the LLM’s judgments. If your metrics show an acceptable agreement between the domain expert and the LLM judge, you can apply this judge against real or synthetic interactions. After this, you can you calculate error rates for different dimensions of your data. You should calculate the error on unseen data only to make sure your aren’t getting biased results.</p>
<p>For example, if you have segmented your data by persona, scenario, feature, etc, your data analysis may look like this</p>
<p><strong>Error Rates by Key Dimensions</strong></p>
<table class="caption-top table">
<thead>
<tr class="header">
<th>Feature</th>
<th>Scenario</th>
<th>Persona</th>
<th>Total Examples</th>
<th>Failure Rate</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td>Order Tracking</td>
<td>Multiple Matches</td>
<td>New User</td>
<td>42</td>
<td>24.3%</td>
</tr>
<tr class="even">
<td>Order Tracking</td>
<td>Multiple Matches</td>
<td>Expert User</td>
<td>38</td>
<td>18.4%</td>
</tr>
<tr class="odd">
<td>Order Tracking</td>
<td>No Matches</td>
<td>Expert User</td>
<td>30</td>
<td>23.3%</td>
</tr>
<tr class="even">
<td>Order Tracking</td>
<td>No Matches</td>
<td>New User</td>
<td>20</td>
<td>75.0%</td>
</tr>
<tr class="odd">
<td>Contact Search</td>
<td>Multiple Matches</td>
<td>New User</td>
<td>35</td>
<td>22.9%</td>
</tr>
<tr class="even">
<td>Contact Search</td>
<td>Multiple Matches</td>
<td>Expert User</td>
<td>32</td>
<td>19.7%</td>
</tr>
<tr class="odd">
<td>Contact Search</td>
<td>No Matches</td>
<td>New User</td>
<td>25</td>
<td>68.0%</td>
</tr>
<tr class="even">
<td>Contact Search</td>
<td>No Matches</td>
<td>Expert User</td>
<td>28</td>
<td>21.4%</td>
</tr>
</tbody>
</table>
<section id="classify-traces" class="level3">
<h3 class="anchored" data-anchor-id="classify-traces">Classify Traces</h3>
<p>Once you know where the errors are now you can perform an error analysis to get to the root cause of the errors. My favorite way is to look at examples of each type of error and classify them by hand. I recommend using a spreadsheet for this. For example, a trace for Order tracking where there are no matches for new users might look like this:</p>
<div class="callout callout-style-default callout-tip callout-titled">
<div class="callout-header d-flex align-content-center">
<div class="callout-icon-container">
<i class="callout-icon"></i>
</div>
<div class="callout-title-container flex-fill">
<span class="screen-reader-only">Tip</span>Example Trace
</div>
</div>
<div class="callout-body-container callout-body">
<p>In this example trace, the user provides an invalid order number. The AI correctly identifies that the order number is invalid but provides an unhelpful response. If you are not familiar with logging LLM traces, refer to my <a href="https://hamel.dev/blog/posts/evals/">previous post on evals</a>.</p>
<p>Note that this trace is formatted for readability.</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb2" style="background: #f1f3f5;"><pre class="sourceCode json code-with-copy"><code class="sourceCode json"><span id="cb2-1"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">{</span></span>
<span id="cb2-2"> <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"user_input"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Where's my order #ABC123?"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">,</span></span>
<span id="cb2-3"> <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"function_calls"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">[</span></span>
<span id="cb2-4">   <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">{</span></span>
<span id="cb2-5">     <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"name"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"search_order_database"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">,</span></span>
<span id="cb2-6">     <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"args"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">{</span><span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"order_id"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"ABC123"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">},</span></span>
<span id="cb2-7">     <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"result"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">{</span></span>
<span id="cb2-8">       <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"status"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"not_found"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">,</span></span>
<span id="cb2-9">       <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"valid_patterns"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">[</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"XXX-XXX-XXX"</span><span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">]</span></span>
<span id="cb2-10">     <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">}</span></span>
<span id="cb2-11">   <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">}</span><span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">,</span></span>
<span id="cb2-12">   <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">{</span></span>
<span id="cb2-13">     <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"name"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"retrieve_context"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">,</span></span>
<span id="cb2-14">     <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"result"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">{</span></span>
<span id="cb2-15">       <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"relevant_docs"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">[</span></span>
<span id="cb2-16">         <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Order numbers follow format XXX-XXX-XXX"</span><span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">,</span></span>
<span id="cb2-17">         <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"New users should check confirmation email"</span></span>
<span id="cb2-18">       <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">]</span></span>
<span id="cb2-19">     <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">}</span></span>
<span id="cb2-20">   <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">}</span></span>
<span id="cb2-21"> <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">]</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">,</span></span>
<span id="cb2-22"> <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"llm_intermediate_steps"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">[</span></span>
<span id="cb2-23">   <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">{</span></span>
<span id="cb2-24">     <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"thought"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"User is new and order format is invalid"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">,</span></span>
<span id="cb2-25">     <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"action"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Generate help message with format info"</span></span>
<span id="cb2-26">   <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">}</span></span>
<span id="cb2-27"> <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">]</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">,</span></span>
<span id="cb2-28"> <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"final_response"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"I cannot find that order #. Please check the number and try again."</span></span>
<span id="cb2-29"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">}</span></span></code></pre></div></div>
</div>
</div>
<p>In this case, you might classify the error as: <code>Missing User Education</code>. The system retrieved new user context and format information but failed to include it in the response, which suggests we could improve our prompt. After you have classified a number of errors, you can calculate the distribution of errors by root cause. That might look like this:</p>
<p><strong>Root Cause Distribution (20 Failed Interactions)</strong></p>
<table class="caption-top table">
<thead>
<tr class="header">
<th>Root Cause</th>
<th>Count</th>
<th>Percentage</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td>Missing User Education</td>
<td>8</td>
<td>40%</td>
</tr>
<tr class="even">
<td>Authentication/Access Issues</td>
<td>6</td>
<td>30%</td>
</tr>
<tr class="odd">
<td>Poor Context Handling</td>
<td>4</td>
<td>20%</td>
</tr>
<tr class="even">
<td>Inadequate Error Messages</td>
<td>2</td>
<td>10%</td>
</tr>
</tbody>
</table>
<p>Now you know where to focus your efforts. This doesn’t have to take an extraordinary amount of time. You can get quite far in just 15 minutes. Also, you can use a LLM to help you with this classification, but that is beyond the scope of this post (you can use a LLM to help you do anything in this post, as long as you have a process to verify the results).</p>
</section>
<section id="an-interactive-walkthrough-of-error-analysis" class="level3">
<h3 class="anchored" data-anchor-id="an-interactive-walkthrough-of-error-analysis">An Interactive Walkthrough of Error Analysis</h3>
<p>Error analysis has been around in Machine Learning for quite some time. This video by Andrew Ng does a great job of walking through the process interactively:</p>
<div class="quarto-video ratio ratio-16x9"><iframe data-external="1" src="https://www.youtube.com/embed/JoAxZsdw_3w?si=R4P1awrDBIYDhJS9" title="" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen=""></iframe></div>
</section>
<section id="fix-your-errors-again" class="level3">
<h3 class="anchored" data-anchor-id="fix-your-errors-again">Fix Your Errors, Again</h3>
<p>Now that you have a sense of the errors, you can go back and fix them again. Go back to step 3 and iterate until you are satisfied. Note that every time you fix an error, you should try to write a test case for it. Sometimes, this can be an assertion in your test suite, but other times you may need to create a more “specialized” LLM judge for these failures. We’ll talk about this next.</p>
</section>
<section id="doing-this-well-requires-data-literacy" class="level3">
<h3 class="anchored" data-anchor-id="doing-this-well-requires-data-literacy">Doing this well requires data literacy</h3>
<p>Investigating your data is much harder in practice than I made it look in this post. It requires a nose for data that only comes from practice. It also helps to have some basic familiarity with statistics and data analysis tools. My favorite post on data literacy is <a href="https://jxnl.co/writing/2024/06/02/10-ways-to-be-data-illiterate-and-how-to-avoid-them/">this one</a> by Jason Liu and Eugene Yan.</p>
</section>
</section>
<section id="step-7-create-more-specialized-llm-judges-if-needed" class="level2">
<h2 class="anchored" data-anchor-id="step-7-create-more-specialized-llm-judges-if-needed">Step 7: Create More Specialized LLM Judges (if needed)</h2>
<p>Now that you have a sense for where the problems in your AI are, you can decide where and if to invest in more targeted LLM judges. For example, if you find that the AI has trouble with citing sources correctly, you can created a targeted eval for that. You might not even need a LLM judge for some errors (and use a code-based assertion instead).</p>
<p>The key takeaway is don’t jump directly to using specialized LLM judges until you have gone through this critique shadowing process. This will help you rationalize where to invest your time.</p>
</section>
<section id="recap-of-critique-shadowing" class="level2">
<h2 class="anchored" data-anchor-id="recap-of-critique-shadowing">Recap of Critique Shadowing</h2>
<p>Using an LLM as a judge can streamline your AI evaluation process if approached correctly. Here’s a visual illustration of the process (there is a description of the process below the diagram as well):</p>
<div class="cell" data-layout-align="default">
<div class="cell-output-display">
<div>
<p></p><figure class="figure"><p></p>
<div>
<pre class="mermaid mermaid-js">graph TB
    A[Start] --&gt; B[1 Find Principal Domain Expert]
    B --&gt; C[2 Create Dataset]
    C --&gt; D[3 Domain Expert Reviews Data]
    D --&gt; E{Found Errors?}
    E --&gt;|Yes| F[4 Fix Errors]
    F --&gt; D
    E --&gt;|No| G[5 Build LLM Judge]
    G --&gt; H[Test Against Domain Expert]
    H --&gt; I{Acceptable Agreement?}
    I --&gt;|No| J[Refine Prompt]
    J --&gt; H
    I --&gt;|Yes| K[6 Perform Error Analysis]
    K --&gt; L{Critical Issues Found?}
    L --&gt;|Yes| M[7 Fix Issues &amp; Create Specialized Judges]
    M --&gt; D
    L --&gt;|No| N[Material Changes or Periodic Review?]
    N --&gt;|Yes| C
</pre>
</div>
<p></p></figure><p></p>
</div>
</div>
</div>
<p>The Critique Shadowing process is iterative, with feedback loops. Let’s list out the steps:</p>
<ol type="1">
<li>Find Principal Domain Expert</li>
<li>Create A Dataset
<ul>
<li>Generate diverse examples covering your use cases</li>
<li>Include real or synthetic user interactions</li>
</ul></li>
<li>Domain Expert Reviews Data
<ul>
<li>Expert makes pass/fail judgments</li>
<li>Expert writes detailed critiques explaining their reasoning</li>
</ul></li>
<li>Fix Errors (if found)
<ul>
<li>Address any issues discovered during review</li>
<li>Return to expert review to verify fixes</li>
<li>Go back to step 3 if errors are found</li>
</ul></li>
<li>Build LLM Judge
<ul>
<li>Create prompt using expert examples</li>
<li>Test against expert judgments</li>
<li>Refine prompt until agreement is satisfactory</li>
</ul></li>
<li>Perform Error Analysis
<ul>
<li>Calculate error rates across different dimensions</li>
<li>Identify patterns and root causes</li>
<li>Fix errors and go back to step 3 if needed</li>
<li>Create specialized judges as needed</li>
</ul></li>
</ol>
<p>This process never truly ends. It repeats periodically or when material changes occur.</p>
<section id="its-not-the-judge-that-created-value-after-all" class="level3">
<h3 class="anchored" data-anchor-id="its-not-the-judge-that-created-value-after-all">It’s Not The Judge That Created Value, After all</h3>
<p>The real value of this process is looking at your data and doing careful analysis. Even though an AI judge can be a helpful tool, going through this process is what drives results. I would go as far as saying that creating a LLM judge is a nice “hack” I use to trick people into carefully looking at their data!</p>
<p>That’s right. The real business value comes from looking at your data. But hey, potato, potahto.</p>
</section>
<section id="do-you-really-need-this" class="level3">
<h3 class="anchored" data-anchor-id="do-you-really-need-this">Do You Really Need This?</h3>
<p>Phew, this seems like a lot of work! Do you really need this? Well, it depends. There are cases where you can take a shortcut through this process. For example, let’s say:</p>
<ol type="1">
<li>You are an independent developer who is also a domain expert.</li>
<li>You are working with test data that already available. (Tweets, etc.)</li>
<li>Looking at data is not costly (etc. you can manually look at enough data in a few hours)</li>
</ol>
<p>In this scenario, you can jump directly to something that looks like step 3 and start looking at data right away. Also, since it’s not that costly to look at data, it’s probably fine to just do error analysis without a judge (at least initially). You can incorporate what you learn directly back into your primary model right away. This example is not exhaustive, but gives you an idea of how you can adapt this process to your needs.</p>
<p>However, you can never completely eliminate looking at your data! This is precisely the step that most people skip. Don’t be that person.</p>
</section>
</section>
<section id="faq" class="level2">
<h2 class="anchored" data-anchor-id="faq">FAQ</h2>
<p>I received <a href="https://x.com/HamelHusain/status/1850256204553244713">a lot of questions</a> about this topic. Here are answers to the most common ones:</p>
<section id="if-i-have-a-good-judge-llm-isnt-that-also-the-llm-id-also-want-to-use" class="level3">
<h3 class="anchored" data-anchor-id="if-i-have-a-good-judge-llm-isnt-that-also-the-llm-id-also-want-to-use">If I have a good judge LLM, isn’t that also the LLM I’d also want to use?</h3>
<p>Effective judges often use larger models or more compute (via longer prompts, chain-of-thought, etc.) than the systems they evaluate.</p>
<p>However, If the cost of the most powerful LLM is not prohibitive, and latency is not an issue, then you might want to consider where you invest your efforts differently. In this case, it might make sense to put more effort towards specialist LLM judges, <a href="https://hamel.dev/blog/posts/evals/#the-types-of-evaluation">code-based assertions, and A/B testing</a>. However, you should still go through the process of looking at data and critiquing the LLM’s output before you adopt specialized judges.</p>
</section>
<section id="do-you-recommend-fine-tuning-judges" class="level3">
<h3 class="anchored" data-anchor-id="do-you-recommend-fine-tuning-judges">Do you recommend fine-tuning judges?</h3>
<p>I prefer not to fine-tune LLM judges. I’d rather spend the effort fine-tuning the actual LLM instead. However, fine-tuning guardrails or other specialized judges can be useful (especially if they are small classifiers).</p>
<p>As a related note, you can leverage a LLM judge to curate and transform data for fine-tuning your primary model. For example, you can use the judge to:</p>
<ul>
<li>Eliminate bad examples for fine-tuning.</li>
<li>Generate higher quality outputs (by referencing the critique).</li>
<li>Simulate high quality chain-of-thought with critiques.</li>
</ul>
<p>Using a LLM judge for enhancing fine-tuning data is even more compelling when you are trying to <a href="https://openai.com/index/api-model-distillation/">distill a large LLM into a smaller one</a>. The details of fine-tuning are beyond the scope of this post. If you are interested in learning more, see <a href="https://parlance-labs.com/education/#fine-tuning">these resources</a>.</p>
</section>
<section id="whats-wrong-with-off-the-shelf-llm-judges" class="level3">
<h3 class="anchored" data-anchor-id="whats-wrong-with-off-the-shelf-llm-judges">What’s wrong with off-the-shelf LLM judges?</h3>
<p>Nothing is strictly wrong with them. It’s just that many people are led astray by them. If you are disciplined you can apply them to your data and see if they are telling you something valuable. However, I’ve found that these tend to cause more confusion than value.</p>
</section>
<section id="how-do-you-evaluate-the-llm-judge" class="level3">
<h3 class="anchored" data-anchor-id="how-do-you-evaluate-the-llm-judge">How Do you evaluate the LLM judge?</h3>
<p>You will collect metrics on the agreement between the domain expert and the LLM judge. This tells you how much you can trust the judge and in what scenarios. Your domain expert doesn’t have to inspect every single example, you just need a representative sample so you can have reliable statistics.</p>
</section>
<section id="what-model-do-you-use-for-the-llm-judge" class="level3">
<h3 class="anchored" data-anchor-id="what-model-do-you-use-for-the-llm-judge">What model do you use for the LLM judge?</h3>
<p>For the kind of judge articulated in this blog post, I like to use the most powerful model I can afford in my cost/latency budget. This budget might be different than my primary model, depending on the number of examples I need to score. This can vary significantly according to the use case.</p>
</section>
<section id="what-about-guardrails" class="level3">
<h3 class="anchored" data-anchor-id="what-about-guardrails">What about guardrails?</h3>
<p>Guardrails are a separate but related topic. They are a way to prevent the LLM from saying/doing something harmful or inappropriate. This blog post focuses on helping you create a judge that’s aligned with business goals, especially when starting out.</p>
</section>
<section id="im-using-llm-as-a-judge-and-getting-tremendous-value-but-i-didnt-follow-this-approach." class="level3">
<h3 class="anchored" data-anchor-id="im-using-llm-as-a-judge-and-getting-tremendous-value-but-i-didnt-follow-this-approach.">I’m using LLM as a judge, and getting tremendous value but I didn’t follow this approach.</h3>
<p>I believe you. This blog post is not the only way to use a LLM as a judge. In fact, I’ve seen people use a LLM as a judge in all sorts of creative ways, which include ranking, classification, model selection and so-on. I’m focused on an approach that works well when you are getting started, and avoids the pitfalls of confusing metric sprawl. However, the general process of looking at the data is still central no matter what kind of judge you are building.</p>
</section>
<section id="how-do-you-choose-between-traditional-ml-techniques-llm-as-a-judge-and-human-annotations" class="level3">
<h3 class="anchored" data-anchor-id="how-do-you-choose-between-traditional-ml-techniques-llm-as-a-judge-and-human-annotations">How do you choose between traditional ML techniques, LLM-as-a-judge and human annotations?</h3>
<p>The answer to this (and many other questions) is: do the simplest thing that works. And simple doesn’t always mean traditional ML techniques. Depending on your situation, it might be easier to use a LLM API as a classifier than to train a model and deploy it.</p>
</section>
<section id="can-you-make-judges-from-small-models" class="level3">
<h3 class="anchored" data-anchor-id="can-you-make-judges-from-small-models">Can you make judges from small models?</h3>
<p>Yes, potentially. I’ve only used the larger models for judges. You have to base the answer to this question on the data (i.e.&nbsp;the agreement with the domain expert).</p>
</section>
<section id="how-do-you-ensure-consistency-when-updating-your-llm-model" class="level3">
<h3 class="anchored" data-anchor-id="how-do-you-ensure-consistency-when-updating-your-llm-model">How do you ensure consistency when updating your LLM model?</h3>
<p>You have to go through the process again and measure the results.</p>
</section>
<section id="how-do-you-phase-out-human-in-the-loop-to-scale-this" class="level3">
<h3 class="anchored" data-anchor-id="how-do-you-phase-out-human-in-the-loop-to-scale-this">How do you phase out human in the loop to scale this?</h3>
<p>You don’t need a domain expert to grade every single example. You just need a representative sample. I don’t think you can eliminate humans completely, because the LLM still needs to be aligned to something, and that something is usually a human. As your evaluation system gets better, it naturally reduces the amount of human effort required.</p>
</section>
</section>
<section id="resources" class="level2">
<h2 class="anchored" data-anchor-id="resources">Resources</h2>
<p>These are some of the resources I recommend to learn more on this topic:</p>
<ul>
<li><a href="https://hamel.dev/evals">Your AI Product Needs Evals</a>: This blog post is the predecessor to this one, and provides a high-level overview of evals for LLM based products.</li>
<li><a href="https://arxiv.org/abs/2404.12272">Who Validates the Validators? Aligning LLM-Assisted Evaluation of LLM Outputs with Human Preferences</a>: This paper by Shreya Shankar et al provides a good overview of the challenges of evaluating LLMs, and the importance of following a good process.</li>
<li><a href="https://aligneval.com/">Align Eval</a>: Eugene Yan’s new tool that helps you build LLM judges by following a good process. Also read his accompanying <a href="https://eugeneyan.com/writing/aligneval/">blog post</a>.</li>
<li><a href="https://eugeneyan.com/writing/llm-evaluators/">Evaluating the Effectiveness of LLM-Evaluators (aka LLM-as-Judge)</a>: This is a great survey of different use-cases and approaches for LLM judges, also written by Eugene Yan.</li>
<li><a href="https://www.databricks.com/blog/enhancing-llm-as-a-judge-with-grading-notes">Enhancing LLM-As-A-Judge with Grading Notes</a> by Yi Liu et al.&nbsp;Describes an approach very similar to the one in this blog post, and provides another point of view regarding the utility of writing critiques (they call them grading notes).</li>
<li><a href="https://cookbook.openai.com/examples/custom-llm-as-a-judge">Custom LLM as a Judge to Detect Hallucinations with Braintrust</a> by Ankur Goyal and Shaymal Anadkt provide an end-to-end example of building a LLM judge, and for the use case highlighted, authors found that a classification approach was more reliable than numeric ratings (consistent with this blog post).</li>
<li><a href="https://arize.com/blog/techniques-for-self-improving-llm-evals/">Techniques for Self-Improving LLM Evals</a> by Eric Xiao from Arize shows a nice approach to building LLM Evals with some additional tools that are worth checking out.</li>
<li><a href="https://blog.langchain.dev/dosu-langsmith-no-prompt-eng/">How Dosu Used LangSmith to Achieve a 30% Accuracy Improvement with No Prompt Engineering</a> by Langchain shows a nice approach to building LLM prompts with dynamic examples. The idea is simple, but effective. I’ve been adapting it for my own use cases, including LLM judges. Here is a <a href="https://www.youtube.com/watch?v=tHZtq_pJSGo">video walkthrough</a> of the approach.</li>
<li><a href="https://applied-llms.org/">What We’ve Learned From A Year of Building with LLMs</a>: is a great overview of many practical aspects of building with LLMs, with an emphasis on the importance of evaluation.</li>
</ul>
</section>
<section id="stay-connected" class="level2">
<h2 class="anchored" data-anchor-id="stay-connected">Stay Connected</h2>
<p>I’m continuously learning about LLMs, and enjoy sharing my findings. If you’re interested in this journey, consider subscribing.</p>
<p>What to expect:</p>
<ul>
<li>Occasional emails with my latest insights on LLMs</li>
<li>Early access to new content</li>
<li>No spam, just honest thoughts and discoveries</li>
</ul>
<script async="" data-uid="6379a28bdb" src="https://hamel.ck.page/6379a28bdb/index.js"></script>


</section>

 ]]></description>
  <category>LLMs</category>
  <category>evals</category>
  <guid>https://hamel.dev/blog/posts/llm-judge/</guid>
  <pubDate>Tue, 29 Oct 2024 07:00:00 GMT</pubDate>
  <media:content url="https://hamel.dev/blog/posts/llm-judge/images/cover_img.png" medium="image" type="image/png" height="81" width="144"/>
</item>
<item>
  <title>Concurrency Foundations For FastHTML</title>
  <link>https://hamel.dev/</link>
  <description>Concurreny fundamentals for FastHTML
</description>
  <category>FastHTML</category>
  <guid>https://hamel.dev/</guid>
  <pubDate>Thu, 10 Oct 2024 07:00:00 GMT</pubDate>
</item>
<item>
  <title>An Open Course on LLMs, Led by Practitioners</title>
  <dc:creator>Hamel Husain</dc:creator>
  <link>https://hamel.dev/blog/posts/course/</link>
  <description><![CDATA[ 

<!-- Content inserted at the beginning of body tag -->
<!-- Google Tag Manager (noscript) -->
<noscript><iframe src="https://www.googletagmanager.com/ns.html?id=GTM-PKGWQMKL" height="0" width="0" style="display:none;visibility:hidden"></iframe></noscript>
  <!-- End Google Tag Manager (noscript) -->




<p>Today, we are releasing <a href="https://parlance-labs.com/education/">Mastering LLMs</a>, a set of workshops and talks from practitioners on topics like evals, retrieval-augmented-generation (RAG), fine-tuning and more. This course is unique because it is:</p>
<ul>
<li>Taught by 25+ industry veterans who are experts in information retrieval, machine learning, recommendation systems, MLOps and data science. We discuss how this prior art can be applied to LLMs to give you a meaningful advantage.</li>
<li>Focused on applied topics that are relevant to people building AI products.</li>
<li><ins>
<strong>Free and open to everyone</strong>
</ins>
.</li>
</ul>
<p>We have organized and annotated the talks from our popular paid course.<sup>1</sup> This is a survey course for technical ICs (including engineers and data scientists) who have some experience with LLMs and need guidance on how to improve AI products.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><a href="https://parlance-labs.com/education/" target="_blank"><img src="https://hamel.dev/blog/posts/course/course.png" class="img-fluid figure-img"></a></p>
<figcaption><em>Speakers include Jeremy Howard, Sophia Yang, Simon Willison, JJ Allaire, Wing Lian, Mark Saroufim, Jane Xu, Jason Liu, Emmanuel Ameisen, Hailey Schoelkopf, Johno Whitaker, Zach Mueller, John Berryman, Ben Clavié, Abhishek Thakur, Kyle Corbitt, Ankur Goyal, Freddy Boulton, Jo Bergum, Eugene Yan, Shreya Shankar, Charles Frye, Hamel Husain, Dan Becker and more</em></figcaption>
</figure>
</div>
<section id="getting-the-most-value-from-the-course" class="level2">
<h2 class="anchored" data-anchor-id="getting-the-most-value-from-the-course">Getting The Most Value From The Course</h2>
<section id="prerequisites" class="level3">
<h3 class="anchored" data-anchor-id="prerequisites">Prerequisites</h3>
<p>The course assumes basic familiarity with LLMs. If you do not have any experience, we recommend watching <a href="https://www.youtube.com/watch?v=jkrNMKz9pWU">A Hacker’s Guide to LLMs</a>. We also recommend the tutorial <a href="https://www.philschmid.de/instruction-tune-llama-2">Instruction Tuning llama2</a> if you are interested in fine-tuning <sup>2</sup>.</p>
</section>
<section id="navigating-the-material" class="level3">
<h3 class="anchored" data-anchor-id="navigating-the-material">Navigating The Material</h3>
<p>The course has over 40 hours of content. To help you navigate this, we provide:</p>
<ul>
<li><strong>Organization by subject area</strong>: evals, RAG, fine-tuning, building applications and prompt engineering.</li>
<li><strong>Chapter summaries:</strong> quickly peruse topics in each talk and skip ahead</li>
<li><strong>Notes, slides, and resources</strong>: these are resources used in the talk, as well as resources to learn more. Many times we have detailed notes as well!</li>
</ul>
<p>To get started, <a href="https://parlance-labs.com/education">navigate to this page</a> and explore topics that interest you. Feel free to skip sections that aren’t relevant to you. We’ve organized the talks within each subject to enhance your learning experience. Be sure to review the chapter summaries, notes, and resources, which are designed to help you focus on the most relevant content and dive deeper when needed. This is a survey course, which means we focus on introducing topics rather than diving deeply into code. To solidify your understanding, we recommend applying what you learn to a personal project.</p>
</section>
<section id="what-students-are-saying" class="level3">
<h3 class="anchored">What Students Are Saying</h3>
<p>Here are some testimonials from students who have taken the course<sup>3</sup>:</p>
<div class="testimonial-section">
<div class="content-block grid gap-4">
<div class="testimonial g-col-12 g-col-md-6">
<p><img src="https://hamel.dev/blog/posts/course/sanyam.jpeg" class="img-fluid"></p>
<section id="sanyam-bhutani-partner-engineer-meta" class="level2">
<h2 class="anchored" data-anchor-id="sanyam-bhutani-partner-engineer-meta"><em>Sanyam Bhutani, Partner Engineer @ Meta</em></h2>
<section id="there-was-a-magical-time-in-2017-when-fastai-changed-the-deep-learning-world.-this-course-does-the-same-by-extending-very-applied-knowledge-to-llms-best-in-class-teachers-teach-you-their-knowledge-with-no-fluff" class="level3">
<h3 class="anchored" data-anchor-id="there-was-a-magical-time-in-2017-when-fastai-changed-the-deep-learning-world.-this-course-does-the-same-by-extending-very-applied-knowledge-to-llms-best-in-class-teachers-teach-you-their-knowledge-with-no-fluff">There was a magical time in 2017 when fastai changed the deep learning world. This course does the same by extending very applied knowledge to LLMs Best in class teachers teach you their knowledge with no fluff</h3>
</section>
</section>
</div>
<div class="testimonial g-col-12 g-col-md-6">
<p><img src="https://hamel.dev/blog/posts/course/laurian.jpeg" class="img-fluid"></p>
<section id="laurian-full-stack-computational-linguist" class="level2">
<h2 class="anchored" data-anchor-id="laurian-full-stack-computational-linguist"><em>Laurian, Full Stack Computational Linguist</em></h2>
<section id="this-course-was-legendary-still-is-and-the-community-on-discord-is-amazing.-ive-been-through-these-lessons-twice-and-i-have-to-do-it-again-as-there-are-so-many-nuances-you-will-get-once-you-actually-have-those-problems-on-your-own-deployment." class="level3">
<h3 class="anchored" data-anchor-id="this-course-was-legendary-still-is-and-the-community-on-discord-is-amazing.-ive-been-through-these-lessons-twice-and-i-have-to-do-it-again-as-there-are-so-many-nuances-you-will-get-once-you-actually-have-those-problems-on-your-own-deployment.">This course was legendary, still is, and the community on Discord is amazing. I’ve been through these lessons twice and I have to do it again as there are so many nuances you will get once you actually have those problems on your own deployment.!</h3>
</section>
</section>
</div>
<div class="testimonial g-col-12 g-col-md-6">
<p><img src="https://hamel.dev/blog/posts/course/andre.png" class="img-fluid"></p>
<section id="andre-cto" class="level2">
<h2 class="anchored" data-anchor-id="andre-cto"><em>Andre, CTO</em></h2>
<section id="amazing-an-opinionated-view-of-llms-from-tools-to-fine-tuning.-excellent-speakers-giving-some-of-the-best-lectures-and-advice-out-there-a-lot-of-real-life-experiences-and-tips-you-cant-find-anywhere-on-the-web-packed-into-this-amazing-courseworkshopconference-thanks-dan-and-hamel-for-making-this-happen" class="level3">
<h3 class="anchored" data-anchor-id="amazing-an-opinionated-view-of-llms-from-tools-to-fine-tuning.-excellent-speakers-giving-some-of-the-best-lectures-and-advice-out-there-a-lot-of-real-life-experiences-and-tips-you-cant-find-anywhere-on-the-web-packed-into-this-amazing-courseworkshopconference-thanks-dan-and-hamel-for-making-this-happen">Amazing! An opinionated view of LLMs, from tools to fine-tuning. Excellent speakers, giving some of the best lectures and advice out there! A lot of real-life experiences and tips you can’t find anywhere on the web packed into this amazing course/workshop/conference! Thanks Dan and Hamel for making this happen!</h3>
</section>
</section>
</div>
<div class="testimonial g-col-12 g-col-md-6">
<p><img src="https://hamel.dev/blog/posts/course/marcus.png" class="img-fluid"></p>
<section id="marcus-software-engineer" class="level2">
<h2 class="anchored" data-anchor-id="marcus-software-engineer"><em>Marcus, Software Engineer</em></h2>
<section id="the-mastering-llms-conference-answered-several-key-questions-i-had-about-when-to-fine-tune-base-models-building-evaluation-suits-and-when-to-use-rag.-the-sessions-provided-a-valuable-overview-of-the-technical-challenges-and-considerations-involved-in-building-and-deploying-custom-llms." class="level3">
<h3 class="anchored" data-anchor-id="the-mastering-llms-conference-answered-several-key-questions-i-had-about-when-to-fine-tune-base-models-building-evaluation-suits-and-when-to-use-rag.-the-sessions-provided-a-valuable-overview-of-the-technical-challenges-and-considerations-involved-in-building-and-deploying-custom-llms.">The Mastering LLMs conference answered several key questions I had about when to fine-tune base models, building evaluation suits and when to use RAG. The sessions provided a valuable overview of the technical challenges and considerations involved in building and deploying custom LLMs.</h3>
</section>
</section>
</div>
<div class="testimonial g-col-12 g-col-md-6">
<p><img src="https://hamel.dev/blog/posts/course/ali.png" class="img-fluid"></p>
<section id="ali-principal-founder-scty" class="level2">
<h2 class="anchored" data-anchor-id="ali-principal-founder-scty"><em>Ali, Principal &amp; Founder, SCTY</em></h2>
<section id="the-course-that-became-a-conference-filled-with-a-lineup-of-renowned-practitioners-whose-expertise-and-contributions-to-the-field-was-only-exceeded-by-their-generosity-of-spirit." class="level3">
<h3 class="anchored" data-anchor-id="the-course-that-became-a-conference-filled-with-a-lineup-of-renowned-practitioners-whose-expertise-and-contributions-to-the-field-was-only-exceeded-by-their-generosity-of-spirit.">The course that became a conference, filled with a lineup of renowned practitioners whose expertise (and contributions to the field) was only exceeded by their generosity of spirit.</h3>
</section>
</section>
</div>
<div class="testimonial g-col-12 g-col-md-6">
<p><img src="https://hamel.dev/blog/posts/course/lukas.png" class="img-fluid"></p>
<section id="lukas-software-engineer" class="level2">
<h2 class="anchored" data-anchor-id="lukas-software-engineer"><em>Lukas, Software Engineer</em></h2>
<section id="the-sheer-amount-of-diverse-speakers-that-cover-the-same-topics-from-different-approaches-both-praising-andor-degrading-certain-workflows-makes-this-extremely-valuable.-especially-when-a-lot-of-information-online-is-produced-by-those-who-are-building-a-commercial-product-behind-naturally-is-biased-towards-a-fine-tune-a-rag-an-open-source-llm-an-open-ai-llm-etc.-it-is-rather-extra-ordinary-to-have-a-variety-of-opinions-packed-like-this.-thank-you" class="level3">
<h3 class="anchored" data-anchor-id="the-sheer-amount-of-diverse-speakers-that-cover-the-same-topics-from-different-approaches-both-praising-andor-degrading-certain-workflows-makes-this-extremely-valuable.-especially-when-a-lot-of-information-online-is-produced-by-those-who-are-building-a-commercial-product-behind-naturally-is-biased-towards-a-fine-tune-a-rag-an-open-source-llm-an-open-ai-llm-etc.-it-is-rather-extra-ordinary-to-have-a-variety-of-opinions-packed-like-this.-thank-you">The sheer amount of diverse speakers that cover the same topics from different approaches, both praising and/or degrading certain workflows makes this extremely valuable. Especially when a lot of information online, is produced by those, who are building a commercial product behind, naturally is biased towards a fine tune, a RAG, an open source LLM, an open ai LLM etc. It is rather extra ordinary to have a variety of opinions packed like this. Thank you!</h3>
</section>
</section>
</div>
</div>
<p><br></p>
<center>
<a href="https://parlance-labs.com/education" target="_blank">Course Website</a>
</center>
</div>
</section>
</section>
<section id="stay-connected" class="level2">
<h2 class="anchored" data-anchor-id="stay-connected">Stay Connected</h2>
<p>I’m continuously learning about LLMs, and enjoy sharing my findings and thoughts. If you’re interested in this journey, consider subscribing.</p>
<p>What to expect:</p>
<ul>
<li>Occasional emails with my latest insights on LLMs</li>
<li>Early access to new content</li>
<li>No spam, just honest thoughts and discoveries</li>
</ul>
<script async="" data-uid="6379a28bdb" src="https://hamel.ck.page/6379a28bdb/index.js"></script>


</section>


<div id="quarto-appendix" class="default"><section id="footnotes" class="footnotes footnotes-end-of-document"><h2 class="anchored quarto-appendix-heading">Footnotes</h2>

<ol>
<li id="fn1"><p>https://maven.com/parlance-labs/fine-tuning. We had more than 2,000 students in our first cohort. The students who paid for the original course had early access to the material, office hours, generous compute credits, and a lively Discord community.↩︎</p></li>
<li id="fn2"><p>We find that instruction tuning a model to be a very useful educational experience even if you never intend to fine-tune, because it familiarizes you with topics such as (1) working with open weights models (2) generating synthetic data (3) managing prompts (4) fine-tuning (5) and generating predictions.↩︎</p></li>
<li id="fn3"><p>These testimonials are taken from https://maven.com/parlance-labs/fine-tuning.↩︎</p></li>
</ol>
</section></div> ]]></description>
  <category>llms</category>
  <category>courses</category>
  <guid>https://hamel.dev/blog/posts/course/</guid>
  <pubDate>Mon, 29 Jul 2024 07:00:00 GMT</pubDate>
  <media:content url="https://hamel.dev/blog/posts/course/course.png" medium="image" type="image/png" height="81" width="144"/>
</item>
<item>
  <title>What We&#39;ve Learned From A Year of Building with LLMs</title>
  <link>https://hamel.dev/</link>
  <description>A practical guide to building successful LLM products.
</description>
  <category>LLMs</category>
  <guid>https://hamel.dev/</guid>
  <pubDate>Sat, 01 Jun 2024 07:00:00 GMT</pubDate>
</item>
<item>
  <title>Debugging AI With Adversarial Validation</title>
  <dc:creator>Hamel Husain</dc:creator>
  <link>https://hamel.dev/blog/posts/drift/</link>
  <description><![CDATA[ 

<!-- Content inserted at the beginning of body tag -->
<!-- Google Tag Manager (noscript) -->
<noscript><iframe src="https://www.googletagmanager.com/ns.html?id=GTM-PKGWQMKL" height="0" width="0" style="display:none;visibility:hidden"></iframe></noscript>
  <!-- End Google Tag Manager (noscript) -->




<p>For years, I’ve relied on a straightforward method to identify sudden changes in model inputs or training data, known as “drift.” This method, Adversarial Validation<sup>1</sup>, is both simple and effective. The best part? <strong>It requires no complex tools or infrastructure.</strong></p>
<p>Examples where drift can cause bugs in your AI:</p>
<ol type="1">
<li>Your data for <a href="../../../blog/posts/evals/index.html">evaluations</a> are materially different from the inputs your model receives in production, causing your evaluations to be misleading.</li>
<li>Updates to prompts, functions, RAG, and similar elements aren’t incorporated into your fine-tuning or training data, leading to unexpected model behavior in production.</li>
</ol>
<p>No matter how careful you are, bugs can still slip through the cracks. A a high ROI activity is to routinely audit all your AI/ML projects for drift.</p>
<section id="how-it-works" class="level2">
<h2 class="anchored" data-anchor-id="how-it-works">How It Works</h2>
<div class="callout callout-style-default callout-warning callout-titled">
<div class="callout-header d-flex align-content-center">
<div class="callout-icon-container">
<i class="callout-icon"></i>
</div>
<div class="callout-title-container flex-fill">
<span class="screen-reader-only">Warning</span>Uncool Warning
</div>
</div>
<div class="callout-body-container callout-body">
<p>This method is so simple that it might seem uncool. You aren’t going to impress any data scientists. Despite this, it’s too valuable to ignore.</p>
</div>
</div>
<p>This slide from <a href="https://www.youtube.com/watch?v=GHk5HMW4XMA">my talk on MLOps tools</a> explains the technique behind Adversarial Validation<sup>2</sup>:</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://hamel.dev/blog/posts/drift/images/drift_tfx.png" class="img-fluid figure-img"></p>
<figcaption>Slide</figcaption>
</figure>
</div>
<p>The process is as follows:</p>
<ol type="1">
<li>Collect two datasets to compare. For example:
<ul>
<li>Training data from two different fine-tuning runs</li>
<li>Training data vs.&nbsp;evaluation data</li>
<li>Training data vs.&nbsp;production data (organized into the same format)</li>
<li>Data from two different time-periods</li>
</ul></li>
<li>Create features from the dataset. A basic example that creates features from <a href="https://platform.openai.com/tokenizer">tokens</a> is illustrated <a href="https://github.com/hamelsmu/ft-drift/blob/main/ft_drift/model.py">here</a>.<sup>3</sup></li>
<li>Give dataset #1 a label of 0 and dataset #2 a label of 1.</li>
<li>Fit a binary classifier (random forest, logistic regression, etc) to discriminate between the two datasets.</li>
<li>If the classifier demonstrates sufficient predictive power (ex: AUC &gt;=0.60), we know there is drift.</li>
<li>If you used an interpretable model (like logistic regression, random forest, etc.), you can inspect feature importance metrics to understand the root cause of the drift. If you use a more complex model (like a neural network), you can use <a href="https://shap.readthedocs.io/en/latest/">SHAP values</a> or other methods to understand what is causing the drift. I recommend starting with a simple interpretable model.</li>
</ol>
<div class="callout callout-style-default callout-warning callout-titled">
<div class="callout-header d-flex align-content-center">
<div class="callout-icon-container">
<i class="callout-icon"></i>
</div>
<div class="callout-title-container flex-fill">
Warning
</div>
</div>
<div class="callout-body-container callout-body">
<p>If this process doesn’t detect drift, it doesn’t mean there isn’t drift. It just means that we couldn’t detect it with the model and features we used.</p>
</div>
</div>
</section>
<section id="minimal-example-ft_drift" class="level2">
<h2 class="anchored" data-anchor-id="minimal-example-ft_drift">Minimal Example: <code>ft_drift</code></h2>
<p>I work with lots of folks who are fine-tuning models using the OpenAI API. I’ve created a small CLI tool, <a href="https://github.com/hamelsmu/ft-drift/"><code>ft_drift</code></a>, that detects drift between two <a href="https://platform.openai.com/docs/guides/fine-tuning/preparing-your-dataset">multi-turn chat formatted</a> jsonl files. Currently, <code>ft_drift</code> only detects drift in prompt templates, schemas and other token-based drift (as opposed to semantic drift). However, this is a good starting point to understand the general concept of adversarial validation. Here is a demo of this tool at work:</p>
<p><img src="https://hamel.dev/blog/posts/drift/images/drift_cli.gif" class="img-fluid"></p>
<p>The demo is from a real-world example where an unintentional change in a prompt template caused unexpected behavior in a model. The demo shows the tool detecting a difference between two datasets, <code>file_a.jsonl</code> and <code>file_b.jsonl</code>. Afterward, a table of important tokens that account for the drift are shown, such as <code>END-UI-FORMAT</code>, <code>UI-FORMAT</code>, etc. We were able to apply the tool and quickly find the root cause of the issue. The modeling code is embarrassingly simple and located at <a href="https://github.com/hamelsmu/ft-drift/blob/main/ft_drift/model.py"><code>ft_drift/model.py</code></a>. The point is you don’t need sophisticated techniques to get started. You can then take this approach further by adding embeddings to your features to also detect semantic drift. Similarly, you could add additional features by hand like the number of conversation turns, length of messages, etc.</p>


</section>


<div id="quarto-appendix" class="default"><section id="footnotes" class="footnotes footnotes-end-of-document"><h2 class="anchored quarto-appendix-heading">Footnotes</h2>

<ol>
<li id="fn1"><p>I first learned of this technique in 2016 from <a href="https://fastml.com/adversarial-validation-part-two/">this blog post</a> by Zygmunt Zając. Throughout the years, I’ve seen this technique used in a variety of contexts, sometimes with different names.↩︎</p></li>
<li id="fn2"><p>This slide uses the word “skew” which is interchangeable with “drift” in this context.↩︎</p></li>
<li id="fn3"><p>For classic ML, if you are already training a model on this data, you likely have a feature engineering pipeline that you can reuse.↩︎</p></li>
</ol>
</section></div> ]]></description>
  <category>LLMs</category>
  <guid>https://hamel.dev/blog/posts/drift/</guid>
  <pubDate>Fri, 12 Apr 2024 07:00:00 GMT</pubDate>
  <media:content url="https://hamel.dev/blog/posts/drift/images/drift_cover.png" medium="image" type="image/png" height="75" width="144"/>
</item>
<item>
  <title>Your AI Product Needs Evals</title>
  <dc:creator>Hamel Husain</dc:creator>
  <link>https://hamel.dev/blog/posts/evals/</link>
  <description><![CDATA[ 

<!-- Content inserted at the beginning of body tag -->
<!-- Google Tag Manager (noscript) -->
<noscript><iframe src="https://www.googletagmanager.com/ns.html?id=GTM-PKGWQMKL" height="0" width="0" style="display:none;visibility:hidden"></iframe></noscript>
  <!-- End Google Tag Manager (noscript) -->




<section id="motivation" class="level2">
<h2 class="anchored" data-anchor-id="motivation">Motivation</h2>
<p>I started working with language models five years ago when I led the team that created <a href="https://github.com/github/CodeSearchNet">CodeSearchNet</a>, a precursor to GitHub CoPilot. Since then, I’ve seen many successful and unsuccessful approaches to building LLM products. I’ve found that unsuccessful products almost always share a common root cause: <strong>a failure to create robust evaluation systems.</strong></p>
<p>I’m currently an independent consultant who helps companies build domain-specific AI products. I hope companies can save thousands of dollars in consulting fees by reading this post carefully. As much as I love making money, I hate seeing folks make the same mistake repeatedly.</p>
<p>This post outlines my thoughts on building evaluation systems for LLMs-powered AI products.</p>
</section>
<section id="iterating-quickly-success" class="level1">
<h1>Iterating Quickly == Success</h1>
<p>Like software engineering, success with AI hinges on how fast you can iterate. You must have processes and tools for:</p>
<ol type="1">
<li>Evaluating quality (ex: tests).</li>
<li>Debugging issues (ex: logging &amp; inspecting data).</li>
<li>Changing the behavior or the system (prompt eng, fine-tuning, writing code)</li>
</ol>
<p><strong>Many people focus exclusively on #3 above, which prevents them from improving their LLM products beyond a demo.</strong><sup>1</sup> Doing all three activities well creates a virtuous cycle differentiating great from mediocre AI products (see the diagram below for a visualization of this cycle).</p>
<p>If you streamline your evaluation process, all other activities become easy. This is very similar to how tests in software engineering pay massive dividends in the long term despite requiring up-front investment.</p>
<p>To ground this post in a real-world situation, I’ll walk through a case study in which we built a system for rapid improvement. I’ll primarily focus on evaluation as that is the most critical component.</p>
</section>
<section id="case-study-lucy-a-real-estate-ai-assistant" class="level1">
<h1>Case Study: Lucy, A Real Estate AI Assistant</h1>
<p><a href="https://Rechat.com/">Rechat</a> is a SaaS application that allows real estate professionals to perform various tasks, such as managing contracts, searching for listings, building creative assets, managing appointments, and more. The thesis of Rechat is that you can do everything in one place rather than having to context switch between many different tools.</p>
<p><a href="https://Rechat.com/ai/">Rechat’s AI assistant, Lucy</a>, is a canonical AI product: a conversational interface that obviates the need to click, type, and navigate the software. During Lucy’s beginning stages, rapid progress was made with prompt engineering. However, as Lucy’s surface area expanded, the performance of the AI plateaued. Symptoms of this were:</p>
<ol type="1">
<li>Addressing one failure mode led to the emergence of others, resembling a game of whack-a-mole.</li>
<li>There was limited visibility into the AI system’s effectiveness across tasks beyond vibe checks.</li>
<li>Prompts expanded into long and unwieldy forms, attempting to cover numerous edge cases and examples.</li>
</ol>
<section id="problem-how-to-systematically-improve-the-ai" class="level2">
<h2 class="anchored" data-anchor-id="problem-how-to-systematically-improve-the-ai">Problem: How To Systematically Improve The AI?</h2>
<p>To break through this plateau, we created a systematic approach to improving Lucy <strong>centered on evaluation.</strong> Our approach is illustrated by the diagram below.</p>
<p><img src="https://hamel.dev/blog/posts/evals/images/diagram-cover.png" class="img-fluid"></p>
<div class="{callout-note}">
<p>This diagram is a best-faith effort to illustrate my mental model for improving AI systems. In reality, the process is non-linear and can take on many different forms that may or may not look like this diagram.</p>
</div>
<p>I discuss the various components of this system in the context of evaluation below.</p>
</section>
</section>
<section id="the-types-of-evaluation" class="level1">
<h1>The Types Of Evaluation</h1>
<p>Rigorous and systematic evaluation is the most important part of the whole system. That is why “Eval and Curation” is highlighted in yellow at the center of the diagram. You should spend most of your time making your evaluation more robust and streamlined.</p>
<p>There are three levels of evaluation to consider:</p>
<ul>
<li>Level 1: Unit Tests</li>
<li>Level 2: Model &amp; Human Eval (this includes debugging)</li>
<li>Level 3: A/B testing</li>
</ul>
<p>The cost of Level 3 &gt; Level 2 &gt; Level 1. This dictates the cadence and manner you execute them. For example, I often run Level 1 evals on every code change, Level 2 on a set cadence and Level 3 only after significant product changes. It’s also helpful to conquer a good portion of your Level 1 tests before you move into model-based tests, as they require more work and time to execute.</p>
<p>There isn’t a strict formula as to when to introduce each level of testing. You want to balance getting user feedback quickly, managing user perception, and the goals of your AI product. This isn’t too dissimilar from the balancing act you must do for products more generally.</p>
<section id="level-1-unit-tests" class="level2">
<h2 class="anchored" data-anchor-id="level-1-unit-tests">Level 1: Unit Tests</h2>
<p>Unit tests for LLMs are assertions (like you would write in <a href="https://docs.pytest.org/en/8.0.x/">pytest</a>). Unlike typical unit tests, you want to organize these assertions for use in places beyond unit tests, such as data cleaning and automatic retries (using the assertion error to course-correct) during model inference. The important part is that these assertions should run fast and cheaply as you develop your application so that you can run them every time your code changes. If you have trouble thinking of assertions, you should critically examine your traces and failure modes. Also, do not shy away from using an LLM to help you brainstorm assertions!</p>
<section id="step-1-write-scoped-tests" class="level3">
<h3 class="anchored" data-anchor-id="step-1-write-scoped-tests">Step 1: Write Scoped Tests</h3>
<p>The most effective way to think about unit tests is to break down the scope of your LLM into features and scenarios. For example, one feature of Lucy is the ability to find real estate listings, which we can break down into scenarios like so:</p>
<p><strong>Feature: Listing Finder</strong></p>
<p>This feature to be tested is a function call that responds to a user request to find a real estate listing. For example, “Please find listings with more than 3 bedrooms less than $2M in San Jose, CA”</p>
<p>The LLM converts this into a query that gets run against the CRM. The assertion then verifies that the expected number of results is returned. In our test suite, we have three user inputs that trigger each of the scenarios below, which then execute corresponding assertions (this is an oversimplified example for illustrative purposes):</p>
<table class="caption-top table">
<thead>
<tr class="header">
<th>Scenario</th>
<th>Assertions</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td>Only one listing matches user query</td>
<td>len(listing_array) == 1</td>
</tr>
<tr class="even">
<td>Multiple listings match user query</td>
<td>len(listing_array) &gt; 1</td>
</tr>
<tr class="odd">
<td>No listings match user query</td>
<td>len(listing_array) == 0</td>
</tr>
</tbody>
</table>
<p><br> There are also generic tests that aren’t specific to any one feature. For example, here is the code for one such generic test that ensures the UUID is not mentioned in the output:</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb1" style="background: #f1f3f5;"><pre class="sourceCode javascript code-with-copy"><code class="sourceCode javascript"><span id="cb1-1"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">const</span> noExposedUUID <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> message <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">=&gt;</span> {</span>
<span id="cb1-2">  <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// Remove all text within double curly braces</span></span>
<span id="cb1-3">  <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">const</span> sanitizedComment <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> message<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">.</span><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">comment</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">.</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">replace</span>(<span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">/</span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">\{\{</span><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">.</span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*?\}\}</span><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">/g</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">''</span>)</span>
<span id="cb1-4"></span>
<span id="cb1-5">  <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// Search for exposed UUIDs</span></span>
<span id="cb1-6">  <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">const</span> regexp <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">/</span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">[0-9a-f]{8}</span><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">-</span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">[0-9a-f]{4}</span><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">-</span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">[0-9a-f]{4}</span><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">-</span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">[0-9a-f]{4}</span><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">-</span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">[0-9a-f]{12}</span><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">/ig</span></span>
<span id="cb1-7">  <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">const</span> matches <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">Array</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">.</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">from</span>(sanitizedComment<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">.</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">matchAll</span>(regexp))</span>
<span id="cb1-8">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">expect</span>(matches<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">.</span><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">length</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'Exposed UUIDs'</span>)<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">.</span><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">to</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">.</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">equal</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'Exposed UUIDs found'</span>)</span>
<span id="cb1-9">}</span></code></pre></div></div>
<div class="acknowledgments">
<p>CRM results returned to the LLM contain fields that shouldn’t be surfaced to the user; such as the UUID associated with an entry. Our LLM prompt tells the LLM to not include UUIDs. We use a simple regex to assert that the LLM response doesn’t include UUIDs.</p>
</div>
<p><strong>Rechat has hundreds of these unit tests. We continuously update them based on new failures we observe in the data as users challenge the AI or the product evolves.</strong> These unit tests are crucial to getting feedback quickly when iterating on your AI system (prompt engineering, improving RAG, etc.). Many people eventually outgrow their unit tests and move on to other levels of evaluation as their product matures, but it is essential not to skip this step!</p>
</section>
<section id="step-2-create-test-cases" class="level3">
<h3 class="anchored" data-anchor-id="step-2-create-test-cases">Step 2: Create Test Cases</h3>
<p>To test these assertions, you must generate test cases or inputs that will trigger all scenarios you wish to test. I often utilize an LLM to generate these inputs synthetically; for example, here is one such prompt Rechat uses to generate synthetic inputs for a feature that creates and retrieves contacts:</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb2" style="background: #f1f3f5;"><pre class="sourceCode md code-overflow-wrap code-with-copy"><code class="sourceCode markdown"><span id="cb2-1">Write 50 different instructions that a real estate agent can give to his assistant to create contacts on his CRM. The contact details can include name, phone, email, partner name, birthday, tags, company, address and job.</span>
<span id="cb2-2"></span>
<span id="cb2-3">For each of the instructions, you need to generate a second instruction which can be used to look up the created contact.</span>
<span id="cb2-4"></span>
<span id="cb2-5">. The results should be a JSON code block with only one string as the instruction like the following:</span>
<span id="cb2-6"></span>
<span id="cb2-7"></span>
<span id="cb2-8">[</span>
<span id="cb2-9">  ["Create a contact for John (johndoe@apple.com)", </span>
<span id="cb2-10">  "What's the email address of John Smith?"]</span>
<span id="cb2-11">]</span></code></pre></div></div>
<p>Using the above prompt, we generate test cases like below:</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb3" style="background: #f1f3f5;"><pre class="sourceCode python code-overflow-wrap code-with-copy"><code class="sourceCode python"><span id="cb3-1">[ </span>
<span id="cb3-2">    [</span>
<span id="cb3-3">        <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'Create a contact for John Smith (johndoe@apple.com) with phone number 123-456-7890 and address 123 Apple St.'</span>, </span>
<span id="cb3-4">        <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'What</span><span class="ch" style="color: #20794D;
background-color: null;
font-style: inherit;">\'</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">s the email address of John Smith?'</span></span>
<span id="cb3-5">    ],</span>
<span id="cb3-6">    [</span>
<span id="cb3-7">        <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'Add Emily Johnson with phone 987-654-3210, email emilyj@email.com, and company ABC Inc.'</span>, </span>
<span id="cb3-8">        <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'What</span><span class="ch" style="color: #20794D;
background-color: null;
font-style: inherit;">\'</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">s the phone number for Emily Johnson?'</span></span>
<span id="cb3-9">    ],</span>
<span id="cb3-10">    [</span>
<span id="cb3-11">        <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'Create a contact for Tom Williams with birthday 10/20/1985, company XYZ Ltd, and job title Manager.'</span>, </span>
<span id="cb3-12">        <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'What</span><span class="ch" style="color: #20794D;
background-color: null;
font-style: inherit;">\'</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">s Tom Williams</span><span class="ch" style="color: #20794D;
background-color: null;
font-style: inherit;">\'</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;"> job title?'</span></span>
<span id="cb3-13">    ],</span>
<span id="cb3-14">    [</span>
<span id="cb3-15">        <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'Add a contact for Susan Brown with partner name James Brown, and email susanb@email.com.'</span>, </span>
<span id="cb3-16">    <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'What</span><span class="ch" style="color: #20794D;
background-color: null;
font-style: inherit;">\'</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">s the partner name of Susan Brown?'</span></span>
<span id="cb3-17">    ],</span>
<span id="cb3-18">…</span>
<span id="cb3-19">]</span></code></pre></div></div>
<p>For each of these test cases, we execute the first user input to create the contact. We then execute the second query to fetch that contact. If the CRM doesn’t return exactly 1 result then we know there was a problem either creating or fetching the contact. We can also run generic assertions like the one to verify UUIDs are not in the response. You must constantly update these tests as you observe data through human evaluation and debugging. The key is to make these as challenging as possible while representing users’ interactions with the system.</p>
<p>You don’t need to wait for production data to test your system. You can make educated guesses about how users will use your product and generate synthetic data. You can also let a small set of users use your product and let their usage refine your synthetic data generation strategy. One signal you are writing good tests and assertions is when the model struggles to pass them - these failure modes become problems you can solve with techniques like fine-tuning later on.</p>
<p>On a related note, unlike traditional unit tests, you don’t necessarily need a 100% pass rate. Your pass rate is a product decision, depending on the failures you are willing to tolerate.</p>
</section>
<section id="step-3-run-track-your-tests-regularly" class="level3">
<h3 class="anchored" data-anchor-id="step-3-run-track-your-tests-regularly">Step 3: Run &amp; Track Your Tests Regularly</h3>
<p>There are many ways to orchestrate Level 1 tests. Rechat has been leveraging CI infrastructure (e.g., GitHub Actions, GitLab Pipelines, etc.) to execute these tests. However, the tooling for this part of the workflow is nascent and evolving rapidly.</p>
<p>My advice is to orchestrate tests that involve the least friction in your tech stack. In addition to tracking tests, you need to track the results of your tests over time so you can see if you are making progress. If you use CI, you should collect metrics along with versions of your tests/prompts outside your CI system for easy analysis and tracking.</p>
<p>I recommend starting simple and leveraging your existing analytics system to visualize your test results. For example, Rechat uses Metabase to track their LLM test results over time. Below is a screenshot of a dashboard Rechat built with <a href="https://www.metabase.com/">Metabase</a>:</p>
<p><img src="https://hamel.dev/blog/posts/evals/images/metabase.png" class="img-fluid"></p>
<p>This screenshot shows the prevalence of a particular error (shown in yellow) in Lucy before (left) vs after (right) we addressed it.</p>
</section>
</section>
<section id="level-2-human-model-eval" class="level2">
<h2 class="anchored" data-anchor-id="level-2-human-model-eval">Level 2: Human &amp; Model Eval</h2>
<p>After you have built a solid foundation of Level 1 tests, you can move on to other forms of validation that cannot be tested by assertions alone. A prerequisite to performing human and model-based eval is to log your traces.</p>
<section id="logging-traces" class="level3">
<h3 class="anchored" data-anchor-id="logging-traces">Logging Traces</h3>
<p>A trace is a concept that has been around for a while in software engineering and is a log of a sequence of events such as user sessions or a request flow through a distributed system. In other words, tracing is a logical grouping of logs. In the context of LLMs, traces often refer to conversations you have with a LLM. For example, a user message, followed by an AI response, followed by another user message, would be an example of a trace.</p>
<p>There are a growing number of solutions for logging LLM traces.<sup>2</sup> Rechat uses <a href="https://www.langchain.com/langsmith">LangSmith</a>, which logs traces and allows you to view them in a human-readable way with an interactive playground to iterate on prompts. Sometimes, logging your traces requires you to instrument your code. In this case, Rechat was using <a href="https://www.langchain.com/">LangChain</a> which automatically logs trace events to LangSmith for you. Here is a screenshot of what this looks like:</p>
<p><img src="https://hamel.dev/blog/posts/evals/images/langsmith.png" class="img-fluid"></p>
<p>I like LangSmith - it doesn’t require that you use LangChain and is intuitive and easy to use. Searching, filtering, and reading traces are essential features for whatever solution you pick. I’ve found that some tools do not implement these basic functions correctly!</p>
</section>
<section id="looking-at-your-traces" class="level3">
<h3 class="anchored" data-anchor-id="looking-at-your-traces">Looking At Your Traces</h3>
<p><strong>You must remove all friction from the process of looking at data.</strong> This means rendering your traces in domain-specific ways. I’ve often found that it’s <a href="https://hamel.dev/notes/llm/finetuning/04_data_cleaning.html">better to build my own data viewing &amp; labeling tool</a> so I can gather all the information I need onto one screen. In Lucy’s case, we needed to look at many sources of information (trace log, the CRM, etc) to understand what the AI did. This is precisely the type of friction that needs to be eliminated. In Rechat’s case, this meant adding information like:</p>
<ol type="1">
<li>What tool (feature) &amp; scenario was being evaluated.</li>
<li>Whether the trace resulted from a synthetic input or a real user input.</li>
<li>Filters to navigate between different tools and scenario combinations.</li>
<li>Links to the CRM and trace logging system for the current record.</li>
</ol>
<p>I’ve built different variations of this tool for each problem I’ve worked on. Sometimes, I even need to embed another application to see what the user interaction looks like. Below is a screenshot of the tool we built to evaluate Rechat’s traces:</p>
<p><img src="https://hamel.dev/blog/posts/evals/images/langfree.png" class="img-fluid"></p>
<p>Another design choice specific to Lucy is that we noticed that many failures involved small mistakes in the final output of the LLM (format, content, etc). We decided to make the final output editable by a human so that we could curate &amp; fix data for fine-tuning.</p>
<p>These tools can be built with lightweight front-end frameworks like Gradio, Streamlit, Panel, or Shiny in less than a day. The tool shown above was built with Shiny for Python. Furthermore, there are tools like <a href="https://www.lilacml.com/">Lilac</a> which uses AI to search and filter data semantically, which is incredibly handy for finding a set of similar data points while debugging an issue.</p>
<p>I often start by labeling examples as good or bad. I’ve found that assigning scores or more granular ratings is more onerous to manage than binary ratings. There are advanced techniques you can use to make human evaluation more efficient or accurate (e.g., <a href="https://en.wikipedia.org/wiki/Active_learning_(machine_learning)">active learning</a>, <a href="https://supervisely.com/blog/labeling-consensus/">consensus voting</a>, etc.), but I recommend starting with something simple. Finally, like unit tests, you should organize and analyze your human-eval results to assess if you are progressing over time.</p>
<p>As discussed later, these labeled examples measure the quality of your system, validate automated evaluation, and curate high-quality synthetic data for fine-tuning.</p>
<section id="how-much-data-should-you-look-at" class="level4">
<h4 class="anchored" data-anchor-id="how-much-data-should-you-look-at">How much data should you look at?</h4>
<p>I often get asked how much data to examine. When starting, you should examine as much data as possible. I usually read traces generated from ALL test cases and user-generated traces at a minimum. <strong>You can never stop looking at data—no free lunch exists.</strong> However, you can sample your data more over time, lessening the burden. <sup>3</sup></p>
</section>
</section>
<section id="automated-evaluation-w-llms" class="level3">
<h3 class="anchored" data-anchor-id="automated-evaluation-w-llms">Automated Evaluation w/ LLMs</h3>
<p>Many vendors want to sell you tools that claim to eliminate the need for a human to look at the data. Having humans periodically evaluate at least a sample of traces is a good idea. I often find that “correctness” is somewhat subjective, and you must align the model with a human.</p>
<p>You should track the correlation between model-based and human evaluation to decide how much you can rely on automatic evaluation. Furthermore, by collecting critiques from labelers explaining why they are making a decision, you can iterate on the evaluator model to align it with humans through prompt engineering or fine-tuning. However, I tend to favor prompt engineering for evaluator model alignment.</p>
<p>I love using low-tech solutions like Excel to iterate on aligning model-based eval with humans. For example, I sent my colleague Phillip the following spreadsheet every few days to grade for a different use-case involving a <a href="https://www.honeycomb.io/blog/introducing-query-assistant">natural language query generator</a>. This spreadsheet would contain the following information:</p>
<ol type="1">
<li><strong>model response</strong>: this is the prediction made by the LLM.</li>
<li><strong>model critique</strong>: this is a critique written by a (usually more powerful) LLM about your original LLM’s prediction.</li>
<li><strong>model outcome</strong>: this is a binary label the critique model assigns to the <code>model response</code> as being “good” or “bad.”</li>
</ol>
<p>Phillip then fills out his version of the same information - meaning his critique, outcome, and desired response for 25-50 examples at a time (these are the columns prefixed with “phillip_” below):</p>
<p><img src="https://hamel.dev/blog/posts/evals/images/spreadsheet.png" class="img-fluid"></p>
<p>This information allowed me to iterate on the prompt of the critique model to make it sufficiently aligned with Phillip over time. This is also easy to track in a low-tech way in a spreadsheet:</p>
<p><img src="https://hamel.dev/blog/posts/evals/images/score.png" class="img-fluid"></p>
<p>This is a screenshot of a spreadsheet where we recorded our attempts to align model-based eval with a human evaluator.</p>
<p>General tips on model-based eval:</p>
<ul>
<li>Use the most powerful model you can afford. It often takes advanced reasoning capabilities to critique something well. You can often get away with a slower, more powerful model for critiquing outputs relative to what you use in production.</li>
<li>Model-based evaluation is a meta-problem within your larger problem. You must maintain a mini-evaluation system to track its quality. I have sometimes fine-tuned a model at this stage (but I try not to).</li>
<li>After bringing the model-based evaluator in line with the human, you must continue doing periodic exercises to monitor the model and human agreement.</li>
</ul>
<div class="callout callout-style-default callout-warning callout-titled">
<div class="callout-header d-flex align-content-center">
<div class="callout-icon-container">
<i class="callout-icon"></i>
</div>
<div class="callout-title-container flex-fill">
<span class="screen-reader-only">Warning</span>Important Note on Using Agreement as a Metric
</div>
</div>
<div class="callout-body-container callout-body">
<p>In this example, we used agreement between the model and human evaluator because our dataset was roughly balanced (about 50% of instances were failures). However, using raw agreement is generally not recommended and can be misleading when classes are imbalanced. Instead, you should typically measure <a href="https://developers.google.com/machine-learning/crash-course/classification/accuracy-precision-recall">precision and recall</a> separately to get a more accurate picture of your judge’s alignment.</p>
</div>
</div>
<p>My favorite aspect about creating a good evaluator model is that its critiques can be used to curate high-quality synthetic data, which I will touch upon later.</p>
</section>
</section>
<section id="level-3-ab-testing" class="level2">
<h2 class="anchored" data-anchor-id="level-3-ab-testing">Level 3: A/B Testing</h2>
<p>Finally, it is always good to perform A/B tests to ensure your AI product is driving user behaviors or outcomes you desire. A/B testing for LLMs compared to other types of products isn’t too different. If you want to learn more about A/B testing, I recommend reading the <a href="https://www.geteppo.com/blog">Eppo blog</a> (which was created by colleagues I used to work with who are rock stars in A/B testing).</p>
<p>It’s okay to put this stage off until you are sufficiently ready and convinced that your AI product is suitable for showing to real users. This level of evaluation is usually only appropriate for more mature products.</p>
</section>
<section id="evaluating-rag" class="level2">
<h2 class="anchored" data-anchor-id="evaluating-rag">Evaluating RAG</h2>
<p>Aside from evaluating your system as a whole, you can evaluate sub-components of your AI, like RAG. Evaluating RAG is beyond the scope of this post, but you can learn more about this subject <a href="https://jxnl.github.io/blog/writing/2024/02/28/levels-of-complexity-rag-applications/">in a post by Jason Liu</a>.</p>
</section>
</section>
<section id="eval-systems-unlock-superpowers-for-free" class="level1">
<h1>Eval Systems Unlock Superpowers For Free</h1>
<p>In addition to iterating fast, eval systems unlock the ability to fine-tune and debug, which can take your AI product to the next level.</p>
<section id="fine-tuning" class="level2">
<h2 class="anchored" data-anchor-id="fine-tuning">Fine-Tuning</h2>
<p>Rechat resolved many failure modes through fine-tuning that were not possible with prompt engineering alone. <strong>Fine-tuning is best for learning syntax, style, and rules, whereas techniques like RAG supply the model with context or up-to-date facts.</strong></p>
<p>99% of the labor involved with fine-tuning is assembling high-quality data that covers your AI product’s surface area. However, if you have a solid evaluation system like Rechat’s, you already have a robust data generation and curation engine! I will expand more on the process of fine-tuning in a future post.<sup>4</sup></p>
<section id="data-synthesis-curation" class="level3">
<h3 class="anchored" data-anchor-id="data-synthesis-curation">Data Synthesis &amp; Curation</h3>
<p>To illustrate why data curation and synthesis come nearly for free once you have an evaluation system, consider the case where you want to create additional fine-tuning data for the listing finder mentioned earlier. First, you can use LLMs to generate synthetic data with a prompt like this:</p>
<pre class="text code-overflow-wrap"><code>Imagine if Zillow was able to parse natural language. Come up with 50 different ways users would be able to search listings there. Use real names for cities and neighborhoods.

You can use the following parameters:

&lt;omitted for confidentiality&gt;

Output should be a JSON code block array. Example:

[
"Homes under $500k in New York"
]</code></pre>
<p>This is almost identical to the exercise for producing test cases! You can then use your Level 1 &amp; Level 2 tests to filter out undesirable data that fails assertions or that the critique model thinks are wrong. You can also use your existing human evaluation tools to look at traces to curate traces for a fine-tuning dataset.</p>
</section>
</section>
<section id="debugging" class="level2">
<h2 class="anchored" data-anchor-id="debugging">Debugging</h2>
<p>When you get a complaint or see an error related to your AI product, you should be able to debug this quickly. If you have a robust evaluation system, you already have:</p>
<ul>
<li>A database of traces that you can search and filter.</li>
<li>A set of mechanisms (assertions, tests, etc) that can help you flag errors and bad behaviors.</li>
<li>Log searching &amp; navigation tools that can help you find the root cause of the error. For example, the error could be RAG, a bug in the code, or a model performing poorly.</li>
<li>The ability to make changes in response to the error and quickly test its efficacy.</li>
</ul>
<p>In short, there is an incredibly large overlap between the infrastructure needed for evaluation and that for debugging.</p>
</section>
</section>
<section id="conclusion" class="level1">
<h1>Conclusion</h1>
<p>Evaluation systems create a flywheel that allows you to iterate very quickly. It’s almost always where people get stuck when building AI products. I hope this post gives you an intuition on how to go about building your evaluation systems. Some key takeaways to keep in mind:</p>
<ul>
<li>Remove ALL friction from looking at data.</li>
<li>Keep it simple. Don’t buy fancy LLM tools. Use what you have first.</li>
<li>You are doing it wrong if you aren’t looking at lots of data.</li>
<li>Don’t rely on generic evaluation frameworks to measure the quality of your AI. Instead, create an evaluation system specific to your problem.</li>
<li>Write lots of tests and frequently update them.</li>
<li>LLMs can be used to unblock the creation of an eval system. Examples include using a LLM to:
<ul>
<li>Generate test cases and write assertions</li>
<li>Generate synthetic data</li>
<li>Critique and label data etc.</li>
</ul></li>
<li>Re-use your eval infrastructure for debugging and fine-tuning.</li>
</ul>
<p>I’d love to hear from you if you found this post helpful or have any questions. My email is <code>hamel@parlance-labs.com</code>.</p>
<p><br></p>
<div class="acknowledgments">
<p><em>This article is an adaptation of <a href="https://www.youtube.com/watch?v=B_DMMlDuJB0">this conversation</a> I had with Emil Sedgh and Hugo Browne-Anderson on the <a href="https://vanishinggradients.fireside.fm/">Vanishing Gradients podcast</a>. Thanks to Jeremy Howard, Eugene Yan, Shreya Shankar, Jeremy Lewi, and Joseph Gleasure for reviewing this article.</em></p>
</div>


</section>


<div id="quarto-appendix" class="default"><section id="footnotes" class="footnotes footnotes-end-of-document"><h2 class="anchored quarto-appendix-heading">Footnotes</h2>

<ol>
<li id="fn1"><p>This is not to suggest that people are lazy. Many don’t know how to set up eval systems and skip these steps.↩︎</p></li>
<li id="fn2"><p>Some examples include <a href="https://arize.com/">arize</a>, <a href="https://humanloop.com/">human loop</a>, <a href="https://github.com/traceloop/openllmetry">openllmetry</a> and <a href="https://www.honeyhive.ai/">honeyhive</a>.↩︎</p></li>
<li id="fn3"><p>A reasonable heuristic is to keep reading logs until you feel like you aren’t learning anything new.↩︎</p></li>
<li id="fn4"><p>If you cannot wait, I’ll be teaching <a href="https://maven.com/parlance-labs/fine-tuning">this course</a> on fine-tuning soon.↩︎</p></li>
</ol>
</section></div> ]]></description>
  <category>LLMs</category>
  <category>evals</category>
  <guid>https://hamel.dev/blog/posts/evals/</guid>
  <pubDate>Fri, 29 Mar 2024 07:00:00 GMT</pubDate>
  <media:content url="https://hamel.dev/blog/posts/evals/images/diagram-cover.png" medium="image" type="image/png" height="77" width="144"/>
</item>
<item>
  <title>Is Fine-Tuning Still Valuable?</title>
  <dc:creator>Hamel Husain</dc:creator>
  <link>https://hamel.dev/blog/posts/fine_tuning_valuable.html</link>
  <description><![CDATA[ 

<!-- Content inserted at the beginning of body tag -->
<!-- Google Tag Manager (noscript) -->
<noscript><iframe src="https://www.googletagmanager.com/ns.html?id=GTM-PKGWQMKL" height="0" width="0" style="display:none;visibility:hidden"></iframe></noscript>
  <!-- End Google Tag Manager (noscript) -->




<p>Here is my personal opinion about the questions I posed in <a href="https://x.com/HamelHusain/status/1772426234032541962?s=20">this tweet</a>:</p>
<blockquote class="twitter-tweet blockquote">
<p lang="en" dir="ltr">
There are a growing number of voices expressing disillusionment with fine-tuning. <br><br>I'm curious about the sentiment more generally. (I am withholding sharing my opinion rn). <br><br>Tweets below are from <a href="https://twitter.com/mlpowered?ref_src=twsrc%5Etfw"><span class="citation" data-cites="mlpowered">@mlpowered</span></a> <a href="https://twitter.com/abacaj?ref_src=twsrc%5Etfw"><span class="citation" data-cites="abacaj">@abacaj</span></a> <a href="https://twitter.com/emollick?ref_src=twsrc%5Etfw"><span class="citation" data-cites="emollick">@emollick</span></a> <a href="https://t.co/cU0hCdubBU">pic.twitter.com/cU0hCdubBU</a>
</p>
— Hamel Husain (<span class="citation" data-cites="HamelHusain">@HamelHusain</span>) <a href="https://twitter.com/HamelHusain/status/1772426234032541962?ref_src=twsrc%5Etfw">March 26, 2024</a>
</blockquote>
<script async="" src="https://platform.twitter.com/widgets.js" charset="utf-8"></script>
<hr>
<p>I think that fine-tuning is still very valuable in many situations. I’ve done some more digging and I find that people who say that fine-tuning isn’t useful are indeed often working on products where fine-tuning isn’t likely to be useful:</p>
<ul>
<li>They are making developer tools - foundation models have been trained extensively on coding tasks.</li>
<li>They are building foundation models and testing for the most general cases. But the foundation models themselves are also being trained for the most general cases.</li>
<li>They are building a personal assistant that isn’t scoped to any type of domain or use case and is essentially similar to the same folks building foundation models.</li>
</ul>
<p>Another common pattern is that people often say this in earlier stages of their product development. One sign that folks are in really early stages is that they don’t have a domain-specific eval harness.</p>
<p><strong>It’s impossible to fine-tune effectively without an eval system which can lead to writing off fine-tuning if you haven’t completed this prerequisite.</strong> It’s also impossible to improve your product without a good eval system in the long term, fine-tuning or not.</p>
<p>You should do as much prompt engineering as possible before you fine-tune. But not for reasons you would think! The reason for doing lots of prompt engineering is that it’s a great way to stress test your eval system!</p>
<p>If you find that prompt-engineering works fine (and you are systematically evaluating your product) then it’s fine to stop there. I’m a big believer in using the simplest approach to solving a problem. I just don’t think you should write off fine-tuning yet.</p>
<section id="examples-where-ive-seen-fine-tuning-work-well" class="level3">
<h3 class="anchored" data-anchor-id="examples-where-ive-seen-fine-tuning-work-well">Examples where I’ve seen fine-tuning work well</h3>
<p>Generally speaking, fine-tuning works best to learn syntax, style and rules whereas techniques like RAG work best to supply the model with context or up-to-date facts.</p>
<p><em>These are some examples from companies I’ve worked with. Hopefully, we will be able to share more details soon.</em></p>
<ul>
<li><p><a href="https://www.honeycomb.io/blog/introducing-query-assistant">Honeycomb’s Natural Language Query Assistant</a> - previously, the “programming manual” for the Honeycomb query language was being dumped into the prompt along with many examples. While this was OK, fine-tuning worked much better to allow the model to learn the syntax and rules of this niche domain-specific language.</p></li>
<li><p><a href="https://www.youtube.com/watch?v=B_DMMlDuJB0">ReChat’s Lucy</a> - this is an AI real estate assistant integrated into an existing Real Estate CRM system. ReChat needs LLM responses to be provided in a very idiosyncratic format that weaves together structured and unstructured data to allow the front end to render widgets, cards and other interactive elements dynamically into the chat interface. Fine-tuning was the key to making this work correctly. <a href="https://www.youtube.com/watch?v=B_DMMlDuJB0">This talk</a> has more details.</p></li>
</ul>
<p>P.S. Fine-tuning is not only limited to open or “small” models. There are lots of folks who have been fine-tuning GPT-3.5, such as <a href="https://x.com/perplexity_ai/status/1695102998463009254?s=20">Perplexity.AI:</a> and <a href="https://casetext.com/blog/cocounsel-harnesses-gpt-4s-power-to-deliver-results-that-legal-professionals-can-rely-on/">CaseText</a>, to name a few.</p>


</section>

 ]]></description>
  <guid>https://hamel.dev/blog/posts/fine_tuning_valuable.html</guid>
  <pubDate>Wed, 27 Mar 2024 07:00:00 GMT</pubDate>
</item>
<item>
  <title>Fuck You, Show Me The Prompt.</title>
  <dc:creator>Hamel Husain</dc:creator>
  <link>https://hamel.dev/blog/posts/prompt/</link>
  <description><![CDATA[ 

<!-- Content inserted at the beginning of body tag -->
<!-- Google Tag Manager (noscript) -->
<noscript><iframe src="https://www.googletagmanager.com/ns.html?id=GTM-PKGWQMKL" height="0" width="0" style="display:none;visibility:hidden"></iframe></noscript>
  <!-- End Google Tag Manager (noscript) -->




<section id="background" class="level2">
<h2 class="anchored" data-anchor-id="background">Background</h2>
<p>There are many libraries that aim to make the output of your LLMs better by <strong>re-writing or constructing the prompt for you</strong>. These libraries purport to make the output of your LLMs:</p>
<ul>
<li>safer <a href="https://github.com/guardrails-ai/guardrails">(ex: guardrails)</a></li>
<li>deterministic <a href="https://github.com/guidance-ai/guidance">(ex: guidance)</a></li>
<li>structured <a href="https://github.com/jxnl/instructor">(ex: instructor)</a></li>
<li>resilient <a href="https://www.langchain.com/">(ex: langchain)</a></li>
<li>… or even optimized for an arbitrary metric <a href="https://github.com/stanfordnlp/dspy">(ex: DSPy)</a>.</li>
</ul>
<p>A common theme among <em>some</em> of these tools is they encourage users to disintermediate themselves from prompting.</p>
<blockquote class="blockquote">
<p><a href="https://github.com/stanfordnlp/dspy">DSPy</a>: “This is a new paradigm in which LMs and their prompts fade into the background …. you can compile your program again DSPy will create new effective prompts”</p>
</blockquote>
<blockquote class="blockquote">
<p><a href="https://github.com/guidance-ai/guidance">guidance</a> “guidance is a programming paradigm that offers superior control and efficiency compared to conventional prompting …”</p>
</blockquote>
<p>Even when tools don’t discourage prompting, I’ve often found it difficult to retrieve the final prompt(s) these tools send to the language model. <strong>The prompts sent by these tools to the LLM is a natural language description of what these tools are doing, and is the fastest way to understand how they work.</strong> Furthermore, some tools have <a href="https://github.com/stanfordnlp/dspy?tab=readme-ov-file#4-two-powerful-concepts-signatures--teleprompters">dense terminology</a> to describe internal constructs which can further obfuscate what they are doing.</p>
<p>For reasons I’ll explain below, I think most people would benefit from the following mindset:</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://hamel.dev/blog/posts/prompt/slap_3.jpeg" class="img-fluid quarto-figure quarto-figure-center figure-img"></p>
</figure>
</div>
<p>In this blog post, I’ll show you how you can <strong>intercept API calls w/prompts for any tool, without having to fumble through docs or read source code.</strong> I’ll show you how to setup and operate <a href="https://mitmproxy.org/">mitmproxy</a> with examples from the LLM the tools I previously mentioned.</p>
</section>
<section id="motivation-minimize-accidental-complexity" class="level2">
<h2 class="anchored" data-anchor-id="motivation-minimize-accidental-complexity">Motivation: Minimize accidental complexity</h2>
<p>Before adopting an abstraction, its important to consider the dangers of taking on <a href="https://dev.to/alexbunardzic/software-complexity-essential-accidental-and-incidental-3i4d">accidental complexity</a>. This danger is acute for LLM abstractions relative to programming abstractions. With LLM abstractions, we often force the user to regress towards writing code instead of conversing with the AI in natural language, which can run counter to the purpose of LLMs:</p>
<center>
<blockquote class="twitter-tweet blockquote">
<p lang="en" dir="ltr">
Programming abstraction -&gt; a human-like language you can use to translate your task into machine code<br><br>LLM abstraction -&gt; an unintelligible framework you can use to translate your task into human language
</p>
— Hamel Husain (<span class="citation" data-cites="HamelHusain">@HamelHusain</span>) <a href="https://twitter.com/HamelHusain/status/1754315254413361553">February 5, 2024</a>
</blockquote>
<script async="" src="https://platform.twitter.com/widgets.js" charset="utf-8"></script>
</center>
<p>While this is a cheeky comment, it’s worth keeping this in mind while evaluating tools. There are two primary types of automation that tools provide:</p>
<ul>
<li><strong>Interleaving code and LLMs:</strong> Expressing this automation is often best done through code, since code must be run to carry out the task. Examples include routing, executing functions, retries, chaining, etc.</li>
<li><strong>Re-Writing and constructing prompts</strong>: Expressing your intent is often best done through natural language. However, there are exceptions! For example, it is convenient to express a function definition or schema from code instead of natural language.</li>
</ul>
<p>Many frameworks offer both types of automation. However, going too far with the second type can have negative consequences. Seeing the prompt allows you decide:</p>
<ol type="1">
<li>Is this framework really necessary?</li>
<li>Should I just steal the final prompt (a string) and jettison the framework?</li>
<li>Can we write a better prompt than this (shorter, aligned with your intent, etc)?</li>
<li>Is this the best approach (do the # of API calls seem appropriate)?</li>
</ol>
<p>In my experience, seeing the prompts and API calls are essential to making informed decisions.</p>
</section>
<section id="intercepting-llm-api-calls" class="level2">
<h2 class="anchored" data-anchor-id="intercepting-llm-api-calls">Intercepting LLM API calls</h2>
<p>There are many possible ways to intercept LLM API calls, such as monkey patching source code or finding a user-facing option. I’ve found that those approaches take far too much time since the quality of source code and documentation can vary greatly. After all, I just want to see API calls without worrying about how the code works!</p>
<p>A framework agnostic way to see API calls is to setup a proxy that logs your outgoing API requests. This is easy to do with <a href="https://mitmproxy.org/">mitmproxy</a>, an free, open-source HTTPS proxy.</p>
<section id="setting-up-mitmproxy" class="level3">
<h3 class="anchored" data-anchor-id="setting-up-mitmproxy">Setting Up mitmproxy</h3>
<p>This is an opinionated way to setup <code>mitmproxy</code>that’s beginner-friendly for our intended purposes:</p>
<ol type="1">
<li><p>Follow the installation instructions <a href="https://mitmproxy.org/">on the website</a></p></li>
<li><p>Start the interactive UI by running <code>mitmweb</code> in the terminal. Pay attention to the url of the interactive UI in the logs which will look something like this: <code>Web server listening at http://127.0.0.1:8081/</code></p></li>
<li><p>Next, you need to configure your device (i.e.&nbsp;your laptop) to route all traffic through <code>mitproxy</code>, which listens on <code>http://localhost:8080</code>. Per the documentation:</p>
<blockquote class="blockquote">
<p>We recommend to simply search the web on how to configure an HTTP proxy for your system. Some operating system have a global settings, some browser have their own, other applications use environment variables, etc.</p>
</blockquote>
<p>In my case, A <a href="https://www.google.com/search?q=set+proxy+for+macos&amp;sca_esv=c51a80de1a7d45f0&amp;rlz=1C5CHFA_enUS1048US1049&amp;sxsrf=ACQVn0_ysjr6Kma2_lX8WbB06iPbDi5gUQ%3A1707764982232&amp;ei=9mzKZYXoDcfy0PEPpJqb2Ao&amp;ved=0ahUKEwiFu4CpwKaEAxVHOTQIHSTNBqsQ4dUDCBA&amp;uact=5&amp;oq=set+proxy+for+macos&amp;gs_lp=Egxnd3Mtd2l6LXNlcnAiE3NldCBwcm94eSBmb3IgbWFjb3MyBBAjGCcyBhAAGBYYHjIGEAAYFhgeMgYQABgWGB4yBhAAGBYYHjILEAAYgAQYigUYhgMyCxAAGIAEGIoFGIYDSMk-UMU7WMU7cAd4AZABAJgBVaABVaoBATG4AQPIAQD4AQHCAgoQABhHGNYEGLAD4gMEGAAgQYgGAZAGCA&amp;sclient=gws-wiz-serp">google search for “set proxy for macos”</a> returned these results:</p>
<blockquote class="blockquote">
<p>choose Apple menu &gt; System Settings, click Network in the sidebar, click a network service on the right, click Details, then click Proxies.</p>
</blockquote>
<p>I then insert <code>localhost</code> and <code>8080</code> in the following places in the UI:</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://hamel.dev/blog/posts/prompt/mac.png" class="img-fluid quarto-figure quarto-figure-center figure-img"></p>
</figure>
</div></li>
<li><p>Next, navigate to <a href="http://mitm.it">http://mitm.it</a> and it will give you instructions on how to install the mitmproxy Certificate Authority (CA), which you will need for intercepting HTTPS requests. (You can also do this manually <a href="https://docs.mitmproxy.org/stable/concepts-certificates/#quick-setup">here</a>.) Also, take note of the location of the CA file as we will reference it later.</p></li>
<li><p>You can test that everything works by browsing to a website like <a href="https://mitmproxy.org/">https://mitmproxy.org/</a>, and seeing the corresponding output in the mtimweb UI which for me is located at <a href="http://127.0.0.1:8081/">http://127.0.0.1:8081/</a> (look at the logs in your terminal to get the URL).</p></li>
<li><p>Now that you set everything up, you can disable the proxy that you previously enabled on your network. I do this on my mac by toggling the proxy buttons in the screenshot I showed above. This is because we want to scope the proxy to only the python program to eliminate unnecessary noise.</p></li>
</ol>
<div class="callout callout-style-default callout-tip callout-titled">
<div class="callout-header d-flex align-content-center">
<div class="callout-icon-container">
<i class="callout-icon"></i>
</div>
<div class="callout-title-container flex-fill">
Tip
</div>
</div>
<div class="callout-body-container callout-body">
<p>Networking related software commonly allows you to proxy outgoing requests by setting environment variables. This is the approach we will use to scope our proxy to specific Python programs. However, I encourage you to play with other types of programs to see what you find after you are comfortable!</p>
</div>
</div>
</section>
<section id="environment-variables-for-python" class="level3">
<h3 class="anchored" data-anchor-id="environment-variables-for-python">Environment variables for Python</h3>
<p>We need to set the following environment variables so that the <code>requests</code> and <code>httpx</code> libraries will direct traffic to the proxy and reference the CA file for HTTPS traffic:</p>
<div class="callout callout-style-default callout-important callout-titled">
<div class="callout-header d-flex align-content-center">
<div class="callout-icon-container">
<i class="callout-icon"></i>
</div>
<div class="callout-title-container flex-fill">
Important
</div>
</div>
<div class="callout-body-container callout-body">
<p>Make sure you set these environment variables before running any of the code snippets in this blog post.</p>
</div>
</div>
<div id="07c4617c-1d52-4c99-9bed-645efd73ba71" class="cell" data-quarto-private-1="{&quot;key&quot;:&quot;vscode&quot;,&quot;value&quot;:{&quot;languageId&quot;:&quot;python&quot;}}">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb1" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb1-1"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> os</span>
<span id="cb1-2"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># The location of my CA File</span></span>
<span id="cb1-3">cert_file <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'/Users/hamel/Downloads/mitmproxy-ca-cert.pem'</span> </span>
<span id="cb1-4">os.environ[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'REQUESTS_CA_BUNDLE'</span>] <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> cert_file</span>
<span id="cb1-5">os.environ[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'SSL_CERT_FILE'</span>] <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> cert_file</span>
<span id="cb1-6">os.environ[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'HTTPS_PROXY'</span>] <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'http://127.0.0.1:8080'</span></span></code></pre></div></div>
</div>
<p>You can do a minimal test by running the following code:</p>
<div id="faf4a70a-bde5-47c0-b649-3179c233c189" class="cell" data-quarto-private-1="{&quot;key&quot;:&quot;vscode&quot;,&quot;value&quot;:{&quot;languageId&quot;:&quot;python&quot;}}">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb2" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb2-1"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> requests</span>
<span id="cb2-2">requests.post(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'https://httpbin.org/post'</span>, </span>
<span id="cb2-3">              data<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>{<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'key'</span>: <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'value'</span>})</span></code></pre></div></div>
<div class="cell-output cell-output-display">
<pre><code>&lt;Response [200]&gt;</code></pre>
</div>
</div>
<p>This will appear in the UI like so:</p>
<p><img src="https://hamel.dev/blog/posts/prompt/mitm_01.png" class="img-fluid"></p>
</section>
</section>
<section id="examples" class="level2">
<h2 class="anchored" data-anchor-id="examples">Examples</h2>
<p>Now for the fun part, let’s run through some examples of LLM libraries and intercept their API calls!</p>
<section id="guardrails" class="level3">
<h3 class="anchored" data-anchor-id="guardrails">Guardrails</h3>
<p>Guardrails allows you specify structure and types, which it uses to validate and correct the outputs of large language models. This is a hello world example from the <a href="https://github.com/guardrails-ai/guardrails"><code>guardrails-ai/guardrails</code> README</a>:</p>
<div id="c746dbe7-e96f-4ea0-89d5-ca9b0b0685e8" class="cell" data-quarto-private-1="{&quot;key&quot;:&quot;vscode&quot;,&quot;value&quot;:{&quot;languageId&quot;:&quot;python&quot;}}">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb4" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb4-1"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> pydantic <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> BaseModel, Field</span>
<span id="cb4-2"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> guardrails <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> Guard</span>
<span id="cb4-3"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> openai</span>
<span id="cb4-4"></span>
<span id="cb4-5"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">class</span> Pet(BaseModel):</span>
<span id="cb4-6">    pet_type: <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">str</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> Field(description<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Species of pet"</span>)</span>
<span id="cb4-7">    name: <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">str</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> Field(description<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"a unique pet name"</span>)</span>
<span id="cb4-8"></span>
<span id="cb4-9">prompt <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"""</span></span>
<span id="cb4-10"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">    What kind of pet should I get and what should I name it?</span></span>
<span id="cb4-11"></span>
<span id="cb4-12"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">    $</span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{gr.complete_json_suffix_v2}</span></span>
<span id="cb4-13"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"""</span></span>
<span id="cb4-14">guard <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> Guard.from_pydantic(output_class<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>Pet, prompt<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>prompt)</span>
<span id="cb4-15"></span>
<span id="cb4-16">validated_output, <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span>rest <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> guard(</span>
<span id="cb4-17">    llm_api<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>openai.completions.create,</span>
<span id="cb4-18">    engine<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"gpt-3.5-turbo-instruct"</span></span>
<span id="cb4-19">)</span>
<span id="cb4-20"></span>
<span id="cb4-21"><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(<span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">f"</span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span>validated_output<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">}</span><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">"</span>)</span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code>{
    "pet_type": "dog",
    "name": "Buddy</code></pre>
</div>
</div>
<p>What is happening here? How is this structured output and validation working? Looking at the mitmproxy UI, I can see that the above code resulted in two LLM API calls, the first one with this prompt:</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb6" style="background: #f1f3f5;"><pre class="sourceCode txt code-overflow-wrap code-with-copy"><code class="sourceCode default"><span id="cb6-1">What kind of pet should I get and what should I name it?</span>
<span id="cb6-2"></span>
<span id="cb6-3">    </span>
<span id="cb6-4">Given below is XML that describes the information to extract from this document and the tags to extract it into.</span>
<span id="cb6-5"></span>
<span id="cb6-6">&lt;output&gt;</span>
<span id="cb6-7">    &lt;string name="pet_type" description="Species of pet"/&gt;</span>
<span id="cb6-8">    &lt;string name="name" description="a unique pet name"/&gt;</span>
<span id="cb6-9">&lt;/output&gt;</span>
<span id="cb6-10"></span>
<span id="cb6-11"></span>
<span id="cb6-12">ONLY return a valid JSON object (no other text is necessary), where the key of the field in JSON is the `name` attribute of the corresponding XML, and the value is of the type specified by the corresponding XML's tag. The JSON MUST conform to the XML format, including any types and format requests e.g. requests for lists, objects and specific types. Be correct and concise.</span>
<span id="cb6-13"></span>
<span id="cb6-14">Here are examples of simple (XML, JSON) pairs that show the expected behavior:</span>
<span id="cb6-15">- `&lt;string name='foo' format='two-words lower-case' /&gt;` =&gt; `{'foo': 'example one'}`</span>
<span id="cb6-16">- `&lt;list name='bar'&gt;&lt;string format='upper-case' /&gt;&lt;/list&gt;` =&gt; `{"bar": ['STRING ONE', 'STRING TWO', etc.]}`</span>
<span id="cb6-17">- `&lt;object name='baz'&gt;&lt;string name="foo" format="capitalize two-words" /&gt;&lt;integer name="index" format="1-indexed" /&gt;&lt;/object&gt;` =&gt; `{'baz': {'foo': 'Some String', 'index': 1}}`</span></code></pre></div></div>
<p><strong>Followed by another call with this prompt:</strong></p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb7" style="background: #f1f3f5;"><pre class="sourceCode txt code-overflow-wrap code-with-copy"><code class="sourceCode default"><span id="cb7-1">I was given the following response, which was not parseable as JSON.</span>
<span id="cb7-2"></span>
<span id="cb7-3">"{\n    \"pet_type\": \"dog\",\n    \"name\": \"Buddy"</span>
<span id="cb7-4"></span>
<span id="cb7-5">Help me correct this by making it valid JSON.</span>
<span id="cb7-6"></span>
<span id="cb7-7">Given below is XML that describes the information to extract from this document and the tags to extract it into.</span>
<span id="cb7-8"></span>
<span id="cb7-9">&lt;output&gt;</span>
<span id="cb7-10">    &lt;string name="pet_type" description="Species of pet"/&gt;</span>
<span id="cb7-11">    &lt;string name="name" description="a unique pet name"/&gt;</span>
<span id="cb7-12">&lt;/output&gt;</span>
<span id="cb7-13"></span>
<span id="cb7-14"></span>
<span id="cb7-15">ONLY return a valid JSON object (no other text is necessary), where the key of the field in JSON is the `name` attribute of the corresponding XML, and the value is of the type specified by the corresponding XML's tag. The JSON MUST conform to the XML format, including any types and format requests e.g. requests for lists, objects and specific types. Be correct and concise. If you are unsure anywhere, enter `null`.</span></code></pre></div></div>
<p>Woof. That’s a whole lot of ceremony to get structured output! We learned that this library’s approach to structured output uses XML schemas (while others use function calling). It’s worth considering if you can fashion a better or simpler approach now that the magic has been lifted. Either way, we now have insight into how it works without dragging you into unnecessary complexity, which is a win.</p>
</section>
<section id="guidance" class="level3">
<h3 class="anchored" data-anchor-id="guidance">Guidance</h3>
<p>Guidance offers constrained generation and programming constructs for writing prompts. Let’s dive into a <a href="https://github.com/guidance-ai/guidance/blob/main/notebooks/tutorials/chat.ipynb">chat example from their tutorials</a>:</p>
<div id="cfe6882f-e22c-4666-9a66-6ce41262c6ea" class="cell" data-quarto-private-1="{&quot;key&quot;:&quot;vscode&quot;,&quot;value&quot;:{&quot;languageId&quot;:&quot;python&quot;}}">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb8" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb8-1"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> guidance</span>
<span id="cb8-2">gpt35 <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> guidance.models.OpenAI(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"gpt-3.5-turbo"</span>)</span>
<span id="cb8-3"></span>
<span id="cb8-4"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> re</span>
<span id="cb8-5"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> guidance <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> gen, select, system, user, assistant</span>
<span id="cb8-6"></span>
<span id="cb8-7"><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">@guidance</span></span>
<span id="cb8-8"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> plan_for_goal(lm, goal: <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">str</span>):</span>
<span id="cb8-9">    </span>
<span id="cb8-10">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># This is a helper function which we will use below</span></span>
<span id="cb8-11">    <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> parse_best(prosandcons, options):</span>
<span id="cb8-12">        best <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> re.search(<span class="vs" style="color: #20794D;
background-color: null;
font-style: inherit;">r'Best=</span><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">(</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">\d</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">)</span><span class="vs" style="color: #20794D;
background-color: null;
font-style: inherit;">'</span>, prosandcons)</span>
<span id="cb8-13">        <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">if</span> <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">not</span> best:</span>
<span id="cb8-14">            best <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>  re.search(<span class="vs" style="color: #20794D;
background-color: null;
font-style: inherit;">r'Best</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">.</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*?</span><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">(</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">\d</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">)</span><span class="vs" style="color: #20794D;
background-color: null;
font-style: inherit;">'</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'Best= option is 3'</span>)</span>
<span id="cb8-15">        <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">if</span> best:</span>
<span id="cb8-16">            best <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">int</span>(best.group(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>))</span>
<span id="cb8-17">        <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">else</span>:</span>
<span id="cb8-18">            best <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span></span>
<span id="cb8-19">        <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">return</span> options[best]</span>
<span id="cb8-20"></span>
<span id="cb8-21">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Some general instruction to the model</span></span>
<span id="cb8-22">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">with</span> system():</span>
<span id="cb8-23">        lm <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+=</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"You are a helpful assistant."</span></span>
<span id="cb8-24"></span>
<span id="cb8-25">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Simulate a simple request from the user</span></span>
<span id="cb8-26">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Note that we switch to using 'lm2' here, because these are intermediate steps (so we don't want to overwrite the current lm object)</span></span>
<span id="cb8-27">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">with</span> user():</span>
<span id="cb8-28">        lm2 <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> lm <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> <span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">f"""</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">\</span></span>
<span id="cb8-29"><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">        I want to </span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span>goal<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">}</span></span>
<span id="cb8-30"><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">        Can you please generate one option for how to accomplish this?</span></span>
<span id="cb8-31"><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">        Please make the option very short, at most one line."""</span></span>
<span id="cb8-32"></span>
<span id="cb8-33">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Generate several options. Note that this means several sequential generation requests</span></span>
<span id="cb8-34">    n_options <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">5</span></span>
<span id="cb8-35">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">with</span> assistant():</span>
<span id="cb8-36">        options <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> []</span>
<span id="cb8-37">        <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> i <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">range</span>(n_options):</span>
<span id="cb8-38">            options.append((lm2 <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> gen(name<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'option'</span>, temperature<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">1.0</span>, max_tokens<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">50</span>))[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"option"</span>])</span>
<span id="cb8-39"></span>
<span id="cb8-40">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Have the user request pros and cons</span></span>
<span id="cb8-41">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">with</span> user():</span>
<span id="cb8-42">        lm2 <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+=</span> <span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">f"""</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">\</span></span>
<span id="cb8-43"><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">        I want to </span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span>goal<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">}</span></span>
<span id="cb8-44"><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">        Can you please comment on the pros and cons of each of the following options, and then pick the best option?</span></span>
<span id="cb8-45"><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">        ---</span></span>
<span id="cb8-46"><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">        """</span></span>
<span id="cb8-47">        <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> i, opt <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">enumerate</span>(options):</span>
<span id="cb8-48">            lm2 <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+=</span> <span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">f"Option </span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span>i<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">}</span><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">: </span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span>opt<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">}</span><span class="ch" style="color: #20794D;
background-color: null;
font-style: inherit;">\n</span><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">"</span></span>
<span id="cb8-49">        lm2 <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+=</span> <span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">f"""</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">\</span></span>
<span id="cb8-50"><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">        ---</span></span>
<span id="cb8-51"><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">        Please discuss each option very briefly (one line for pros, one for cons), and end by saying Best=X, where X is the number of the best option."""</span></span>
<span id="cb8-52"></span>
<span id="cb8-53">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Get the pros and cons from the model</span></span>
<span id="cb8-54">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">with</span> assistant():</span>
<span id="cb8-55">        lm2 <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+=</span> gen(name<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'prosandcons'</span>, temperature<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.0</span>, max_tokens<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">600</span>, stop<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Best="</span>) <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Best="</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> gen(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"best"</span>, regex<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"[0-9]+"</span>) </span>
<span id="cb8-56"></span>
<span id="cb8-57">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># The user now extracts the one selected as the best, and asks for a full plan</span></span>
<span id="cb8-58">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># We switch back to 'lm' because this is the final result we want</span></span>
<span id="cb8-59">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">with</span> user():</span>
<span id="cb8-60">        lm <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+=</span> <span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">f"""</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">\</span></span>
<span id="cb8-61"><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">        I want to </span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span>goal<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">}</span></span>
<span id="cb8-62"><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">        Here is my plan: </span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span>options[<span class="bu" style="color: null;
background-color: null;
font-style: inherit;">int</span>(lm2[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"best"</span>])]<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">}</span></span>
<span id="cb8-63"><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">        Please elaborate on this plan, and tell me how to best accomplish it."""</span></span>
<span id="cb8-64"></span>
<span id="cb8-65">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># The plan is generated</span></span>
<span id="cb8-66">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">with</span> assistant():</span>
<span id="cb8-67">        lm <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+=</span> gen(name<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'plan'</span>, max_tokens<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">500</span>)</span>
<span id="cb8-68"></span>
<span id="cb8-69">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">return</span> lm</span></code></pre></div></div>
</div>
<div id="7cbd50af-7689-426c-8779-1b0d5e9bfe7a" class="cell" data-quarto-private-1="{&quot;key&quot;:&quot;vscode&quot;,&quot;value&quot;:{&quot;languageId&quot;:&quot;python&quot;}}">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb9" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb9-1">results <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> gpt35 <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> plan_for_goal(goal<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"read more books"</span>)</span></code></pre></div></div>
<div class="cell-output cell-output-display">
<pre style="margin: 0px; padding: 0px; vertical-align: middle; padding-left: 8px; margin-left: -8px; border-radius: 0px; border-left: 1px solid rgba(127, 127, 127, 0.2); white-space: pre-wrap; font-family: ColfaxAI, Arial; font-size: 15px; line-height: 23px;"><div style="display: flex; border-bottom: 1px solid rgba(127, 127, 127, 0.2); align-items: center;"><div style="flex: 0 0 80px; opacity: 0.5;">system</div><div style="flex-grow: 1; padding: 5px; padding-top: 10px; padding-bottom: 10px; margin-top: 0px; white-space: pre-wrap; margin-bottom: 0px;">You are a helpful assistant.</div></div><div style="display: flex; border-bottom: 1px solid rgba(127, 127, 127, 0.2); align-items: center;"><div style="flex: 0 0 80px; opacity: 0.5;">user</div><div style="flex-grow: 1; padding: 5px; padding-top: 10px; padding-bottom: 10px; margin-top: 0px; white-space: pre-wrap; margin-bottom: 0px;">I want to read more books
Here is my plan: Set aside 30 minutes of dedicated reading time each day.
Please elaborate on this plan, and tell me how to best accomplish it.</div></div><div style="display: flex; border-bottom: 1px solid rgba(127, 127, 127, 0.2); align-items: center;"><div style="flex: 0 0 80px; opacity: 0.5;">assistant</div><div style="flex-grow: 1; padding: 5px; padding-top: 10px; padding-bottom: 10px; margin-top: 0px; white-space: pre-wrap; margin-bottom: 0px;"><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0">Setting</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> aside</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> </span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0">30</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> minutes</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> of</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> dedicated</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> reading</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> time</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> each</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> day</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> is</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> a</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> great</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> plan</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> to</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> read</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> more</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> books</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0">.</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> Here</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> are</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> some</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> tips</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> to</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> help</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> you</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> accomplish</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> this</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> goal</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0">:

</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0">1</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0">.</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> Establish</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> a</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> routine</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0">:</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> Choose</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> a</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> specific</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> time</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> of</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> day</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> that</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> works</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> best</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> for</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> you</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0">,</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> whether</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> it</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0">'s</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> in</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> the</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> morning</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0">,</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> during</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> lunch</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> break</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0">,</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> or</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> before</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> bed</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0">.</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> Cons</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0">istency</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> is</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> key</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> to</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> forming</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> a</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> habit</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0">.

</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0">2</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0">.</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> Create</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> a</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> reading</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0">-friendly</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> environment</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0">:</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> Find</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> a</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> quiet</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> and</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> comfortable</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> spot</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> where</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> you</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> can</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> focus</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> on</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> your</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> reading</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> without</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> distractions</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0">.</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> It</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> could</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> be</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> a</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> cozy</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> corner</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> in</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> your</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> home</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0">,</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> a</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> park</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> bench</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0">,</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> or</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> a</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> local</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> library</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0">.

</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0">3</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0">.</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> Mini</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0">mi</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0">ze</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> distractions</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0">:</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> Put</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> away</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> your</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> phone</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0">,</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> turn</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> off</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> the</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> TV</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0">,</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> and</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> avoid</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> any</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> other</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> potential</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> interruptions</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> during</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> your</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> dedicated</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> reading</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> time</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0">.</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> This</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> will</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> help</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> you</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> stay</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> focused</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> and</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> fully</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> immer</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0">se</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> yourself</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> in</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> the</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> book</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0">.

</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0">4</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0">.</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> Choose</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> books</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> that</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> interest</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> you</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0">:</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> Select</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> books</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> that</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> align</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> with</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> your</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> personal</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> interests</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0">,</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> hobbies</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0">,</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> or</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> goals</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0">.</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> When</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> you</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0">'re</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> genuinely</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> interested</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> in</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> the</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> subject</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> matter</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0">,</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> you</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0">'ll</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> be</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> more</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> motivated</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> to</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> read</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> regularly</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0">.

</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0">5</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0">.</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> Start</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> with</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> manageable</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> goals</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0">:</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> If</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> you</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0">'re</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> new</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> to</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> reading</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> or</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> have</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> a</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> busy</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> schedule</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0">,</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> start</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> with</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> a</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> smaller</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> time</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> commitment</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0">,</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> such</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> as</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> </span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0">15</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> minutes</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0">,</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> and</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> gradually</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> increase</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> it</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> to</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> </span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0">30</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> minutes</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> or</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> more</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> as</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> you</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> become</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> more</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> comfortable</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0">.

</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0">6</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0">.</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> Set</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> a</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> timer</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0">:</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> Use</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> a</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> timer</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> or</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> a</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> reading</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> app</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> that</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> allows</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> you</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> to</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> track</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> your</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> reading</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> time</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0">.</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> This</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> will</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> help</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> you</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> stay</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> accountable</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> and</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> ensure</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> that</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> you</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> dedicate</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> the</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> full</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> </span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0">30</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> minutes</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> to</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> reading</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0">.

</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0">7</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0">.</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> Make</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> reading</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> enjoyable</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0">:</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> Create</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> a</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> cozy</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> reading</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> atmosphere</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> by</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> lighting</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> a</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> candle</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0">,</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> sip</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0">ping</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> a</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> cup</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> of</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> tea</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0">,</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> or</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> playing</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> soft</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> background</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> music</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0">.</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> Eng</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0">aging</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> all</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> your</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> senses</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> can</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> enhance</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> your</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> reading</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> experience</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0">.

</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0">8</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0">.</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> Join</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> a</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> book</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> club</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> or</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> reading</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> group</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0">:</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> Consider</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> joining</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> a</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> book</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> club</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> or</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> participating</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> in</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> a</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> reading</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> group</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> to</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> connect</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> with</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> fellow</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> book</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> lovers</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0">.</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> This</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> can</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> provide</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> additional</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> motivation</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0">,</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> discussion</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> opportunities</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0">,</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> and</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> book</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> recommendations</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0">.

</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0">9</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0">.</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> Keep</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> a</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> reading</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> log</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0">:</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> Maintain</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> a</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> record</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> of</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> the</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> books</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> you</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0">'ve</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> read</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0">,</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> along</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> with</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> your</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> thoughts</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> and</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> reflections</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0">.</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> This</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> can</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> help</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> you</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> track</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> your</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> progress</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0">,</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> discover</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> patterns</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> in</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> your</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> reading</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> preferences</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0">,</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> and</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> serve</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> as</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> a</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> source</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> of</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> inspiration</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> for</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> future</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> reading</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0">.

</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0">10</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0">.</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> Be</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> flexible</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0">:</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> While</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> it</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0">'s</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> important</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> to</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> have</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> a</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> dedicated</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> reading</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> time</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0">,</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> be</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> flexible</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> and</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> adaptable</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0">.</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> Life</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> can</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> sometimes</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> get</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> busy</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0">,</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> so</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> if</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> you</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> miss</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> a</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> day</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0">,</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> don</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0">'t</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> be</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> discouraged</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0">.</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> Simply</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> pick</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> up</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> where</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> you</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> left</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> off</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> and</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> continue</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> with</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> your</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> reading</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> routine</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0">.

</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0">Remember</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0">,</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> the</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> goal</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> is</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> to</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> enjoy</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> the</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> process</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> of</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> reading</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> and</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> make</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> it</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> a</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> regular</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> part</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> of</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> your</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> life</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0">.</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> Happy</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0"> reading</span><span style="background-color: rgba(0.0, 165.0, 0, 0.15); border-radius: 3px;" title="1.0">!</span></div></div></pre>
</div>
</div>
<p>This looks pretty neat! But what is it doing exactly? <strong>This makes a total of 7 calls to OpenAI</strong>, which I have put in <a href="https://gist.github.com/hamelsmu/d0d75bf702e56987f35cb715f7da4d6a">this gist</a>. <strong>5 of 7 of these API calls are “internal” thoughts asking the LLM to generate ideas.</strong> Even though the temperature is set to 1.0, <strong>these “ideas” are mostly redundant.</strong> The penultimate call to OpenAI enumerates these “ideas” which I’ve included below:</p>
<div id="a3a0664f-577d-4a0c-96e7-74527036a0be" class="cell" data-quarto-private-1="{&quot;key&quot;:&quot;vscode&quot;,&quot;value&quot;:{&quot;languageId&quot;:&quot;python&quot;}}">
<div class="cell-output cell-output-stdout">
<pre><code>I want to read more books
Can you please comment on the pros and cons of each of the following options, and then pick the best option?
---
Option 0: Set aside dedicated time each day for reading.
Option 1: Set aside 30 minutes of dedicated reading time each day.
Option 2: Set aside dedicated time each day for reading.
Option 3: Set aside dedicated time each day for reading.
Option 4: Join a book club.
---
Please discuss each option very briefly (one line for pros, one for cons), and end by saying Best=X, where X is the number of the best option.</code></pre>
</div>
</div>
<p>I know from experience that you are likely to get better results if you tell the language model to generate ideas in one shot. That way, the LLM can reference previous ideas and achieve more diversity. This is a good example of accidental complexity: its very tempting to take this design pattern and apply it blindly. This is less of a critique of this particular framework, since the code makes it clear that 5 independent calls will happen. Either way, its good idea to check your work by inspecting API calls!.</p>
</section>
<section id="langchain" class="level3">
<h3 class="anchored" data-anchor-id="langchain">Langchain</h3>
<p>Langchain is a multi-tool for all things LLM. Lots of people rely on Langchain when get started with LLMs. The core LangChain library doesn’t generally hide prompts from you, however there are experimental features that do. Let’s take a look at one of these features called <a href="https://api.python.langchain.com/en/latest/smart_llm/langchain_experimental.smart_llm.base.SmartLLMChain.html">SmartLLMChain</a>:</p>
<div id="600b1e60-6c01-429e-834b-470730b5ea26" class="cell" data-quarto-private-1="{&quot;key&quot;:&quot;vscode&quot;,&quot;value&quot;:{&quot;languageId&quot;:&quot;python&quot;}}">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb11" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb11-1"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> langchain.prompts <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> PromptTemplate</span>
<span id="cb11-2"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> langchain_experimental.smart_llm <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> SmartLLMChain</span>
<span id="cb11-3"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> langchain_openai <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> ChatOpenAI</span>
<span id="cb11-4"></span>
<span id="cb11-5">hard_question <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"I have a 12 liter jug and a 6 liter jug.</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">\</span></span>
<span id="cb11-6"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">I want to measure 6 liters. How do I do it?"</span></span>
<span id="cb11-7">prompt <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> PromptTemplate.from_template(hard_question)</span>
<span id="cb11-8">llm <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> ChatOpenAI(temperature<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>, model_name<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"gpt-3.5-turbo"</span>)</span></code></pre></div></div>
</div>
<div id="bdb002f1-5f1b-40a7-ad4f-c9052813ef1f" class="cell" data-quarto-private-1="{&quot;key&quot;:&quot;vscode&quot;,&quot;value&quot;:{&quot;languageId&quot;:&quot;python&quot;}}">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb12" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb12-1">chain <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> SmartLLMChain(llm<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>llm, prompt<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>prompt, </span>
<span id="cb12-2">                      n_ideas<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span>, </span>
<span id="cb12-3">                      verbose<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">True</span>)</span>
<span id="cb12-4">result <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> chain.run({})</span></code></pre></div></div>
</div>
<div id="13be7600-e9eb-4ed2-91d9-a8182858ca83" class="cell" data-quarto-private-1="{&quot;key&quot;:&quot;vscode&quot;,&quot;value&quot;:{&quot;languageId&quot;:&quot;python&quot;}}">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb13" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb13-1"><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(result)</span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code>Idea 1: 1. Fill the 12 liter jug completely.
2. Pour the contents of the 12 liter jug into the 6 liter jug. This will leave you with 6 liters in the 12 liter jug.
3. Empty the 6 liter jug.
4. Pour the remaining 6 liters from the 12 liter jug into the now empty 6 liter jug.
5. You now have 6 liters in the 6 liter jug.

Idea 2: 1. Fill the 12 liter jug completely.
2. Pour the contents of the 12 liter jug into the 6 liter jug. This will leave you with 6 liters in the 12 liter jug.
3. Empty the 6 liter jug.
4. Pour the remaining 6 liters from the 12 liter jug into the now empty 6 liter jug.
5. You now have 6 liters in the 6 liter jug.

Improved Answer:
1. Fill the 12 liter jug completely.
2. Pour the contents of the 12 liter jug into the 6 liter jug until the 6 liter jug is full. This will leave you with 6 liters in the 12 liter jug and the 6 liter jug completely filled.
3. Empty the 6 liter jug.
4. Pour the remaining 6 liters from the 12 liter jug into the now empty 6 liter jug.
5. You now have 6 liters in the 6 liter jug.

Full Answer:
To measure 6 liters using a 12 liter jug and a 6 liter jug, follow these steps:
1. Fill the 12 liter jug completely.
2. Pour the contents of the 12 liter jug into the 6 liter jug until the 6 liter jug is full. This will leave you with 6 liters in the 12 liter jug and the 6 liter jug completely filled.
3. Empty the 6 liter jug.
4. Pour the remaining 6 liters from the 12 liter jug into the now empty 6 liter jug.
5. You now have 6 liters in the 6 liter jug.</code></pre>
</div>
</div>
<p>Neat! So what happened exactly? While this API emits logs that show you a lot of information (available on <a href="https://gist.github.com/hamelsmu/abfb14b0af4c70e8532f9d4e0ef3e54e">this gist</a>), the API request pattern is interesting:</p>
<ol type="1">
<li><p>Two <em>seperate</em> api calls for each “idea”.</p></li>
<li><p>Another API call that incorporates the two ideas as context, with the prompt:</p>
<blockquote class="blockquote">
<p>You are a researcher tasked with investigating the 2 response options provided. List the flaws and faulty logic of each answer options. Let’w work this out in a step by step way to be sure we have all the errors:”</p>
</blockquote></li>
<li><p>A final API call that that takes the critique from step 2 and generates an answer.</p></li>
</ol>
<p>Its not clear that this approach is optimal. I am not sure it should take 4 separate API calls to accomplish this task. Perhaps the critique and the final answer could be generated in one step? Furthermore, the prompt has a spelling error (<code>Let'w</code>) and also overly focuses on the negative about identifying errors - which makes me skeptical that this prompt has been optimized or tested.</p>
</section>
<section id="instructor" class="level3">
<h3 class="anchored" data-anchor-id="instructor">Instructor</h3>
<p><a href="https://github.com/jxnl/instructor">Instructor</a> is a framework for structured outputs.</p>
<section id="structred-data-extraction-with-pydantic" class="level4">
<h4 class="anchored" data-anchor-id="structred-data-extraction-with-pydantic">Structred data extraction with Pydantic</h4>
<p>Here is a basic example from the project’s <a href="https://github.com/jxnl/instructor">README</a> that allows you to extract structured data by using Pydantic to define your schema.</p>
<div id="8469f9f6-6998-4de7-8551-c073929e385c" class="cell" data-quarto-private-1="{&quot;key&quot;:&quot;vscode&quot;,&quot;value&quot;:{&quot;languageId&quot;:&quot;python&quot;}}">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb15" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb15-1"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> instructor</span>
<span id="cb15-2"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> openai <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> OpenAI</span>
<span id="cb15-3"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> pydantic <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> BaseModel</span>
<span id="cb15-4"></span>
<span id="cb15-5">client <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> instructor.patch(OpenAI())</span>
<span id="cb15-6"></span>
<span id="cb15-7"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">class</span> UserDetail(BaseModel):</span>
<span id="cb15-8">    name: <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">str</span></span>
<span id="cb15-9">    age: <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">int</span></span>
<span id="cb15-10"></span>
<span id="cb15-11">user <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> client.chat.completions.create(</span>
<span id="cb15-12">    model<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"gpt-3.5-turbo"</span>,</span>
<span id="cb15-13">    response_model<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>UserDetail,</span>
<span id="cb15-14">    messages<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>[{<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"role"</span>: <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"user"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"content"</span>: <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Extract Jason is 25 years old"</span>}])</span></code></pre></div></div>
</div>
<p>We can see how this works by inspecting the API call logged to mitmproxy:</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb16" style="background: #f1f3f5;"><pre class="sourceCode json code-overflow-wrap code-with-copy"><code class="sourceCode json"><span id="cb16-1"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">{</span></span>
<span id="cb16-2">    <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"function_call"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">{</span></span>
<span id="cb16-3">        <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"name"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"UserDetail"</span></span>
<span id="cb16-4">    <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">},</span></span>
<span id="cb16-5">    <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"functions"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">[</span></span>
<span id="cb16-6">        <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">{</span></span>
<span id="cb16-7">            <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"description"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Correctly extracted `UserDetail` with all the required parameters with correct types"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">,</span></span>
<span id="cb16-8">            <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"name"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"UserDetail"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">,</span></span>
<span id="cb16-9">            <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"parameters"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">{</span></span>
<span id="cb16-10">                <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"properties"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">{</span></span>
<span id="cb16-11">                    <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"age"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">{</span></span>
<span id="cb16-12">                        <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"title"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Age"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">,</span></span>
<span id="cb16-13">                        <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"type"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"integer"</span></span>
<span id="cb16-14">                    <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">},</span></span>
<span id="cb16-15">                    <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"name"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">{</span></span>
<span id="cb16-16">                        <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"title"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Name"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">,</span></span>
<span id="cb16-17">                        <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"type"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"string"</span></span>
<span id="cb16-18">                    <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">}</span></span>
<span id="cb16-19">                <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">},</span></span>
<span id="cb16-20">                <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"required"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">[</span></span>
<span id="cb16-21">                    <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"age"</span><span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">,</span></span>
<span id="cb16-22">                    <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"name"</span></span>
<span id="cb16-23">                <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">]</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">,</span></span>
<span id="cb16-24">                <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"type"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"object"</span></span>
<span id="cb16-25">            <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">}</span></span>
<span id="cb16-26">        <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">}</span></span>
<span id="cb16-27">    <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">]</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">,</span></span>
<span id="cb16-28">    <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"messages"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">[</span></span>
<span id="cb16-29">        <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">{</span></span>
<span id="cb16-30">            <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"content"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Extract Jason is 25 years old"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">,</span></span>
<span id="cb16-31">            <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"role"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"user"</span></span>
<span id="cb16-32">        <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">}</span></span>
<span id="cb16-33">    <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">]</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">,</span></span>
<span id="cb16-34">    <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"model"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"gpt-3.5-turbo"</span></span>
<span id="cb16-35"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">}</span></span></code></pre></div></div>
<p>This is great. For structured output - <strong>It does exactly what I want, and it correctly uses the OpenAI API the way I would use it</strong> if I were writing this manually (by defining a function schema). I would consider this specific API a zero-cost abstraction, meaning it does exactly what I expect it to with a minimal surface area.</p>
</section>
<section id="validation" class="level4">
<h4 class="anchored" data-anchor-id="validation">Validation</h4>
<p>However, instructor has other APIs that are more agressive and write prompts for you. For example, consider this <a href="https://python.useinstructor.com/tutorials/4-validation/">validation example</a>. Running through that example should trigger similar questions to the exploration of Langchain’s SmartLLMChain above. In this example, you will observe 3 LLM API calls to get the right answer, with the final payload looking like this:</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb17" style="background: #f1f3f5;"><pre class="sourceCode json code-with-copy"><code class="sourceCode json"><span id="cb17-1"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">{</span></span>
<span id="cb17-2">    <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"function_call"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">{</span></span>
<span id="cb17-3">        <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"name"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Validator"</span></span>
<span id="cb17-4">    <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">},</span></span>
<span id="cb17-5">    <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"functions"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">[</span></span>
<span id="cb17-6">        <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">{</span></span>
<span id="cb17-7">            <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"description"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Validate if an attribute is correct and if not,</span><span class="ch" style="color: #20794D;
background-color: null;
font-style: inherit;">\n</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">return a new value with an error message"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">,</span></span>
<span id="cb17-8">            <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"name"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Validator"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">,</span></span>
<span id="cb17-9">            <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"parameters"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">{</span></span>
<span id="cb17-10">                <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"properties"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">{</span></span>
<span id="cb17-11">                    <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"fixed_value"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">{</span></span>
<span id="cb17-12">                        <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"anyOf"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">[</span></span>
<span id="cb17-13">                            <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">{</span></span>
<span id="cb17-14">                                <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"type"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"string"</span></span>
<span id="cb17-15">                            <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">}</span><span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">,</span></span>
<span id="cb17-16">                            <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">{</span></span>
<span id="cb17-17">                                <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"type"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"null"</span></span>
<span id="cb17-18">                            <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">}</span></span>
<span id="cb17-19">                        <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">]</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">,</span></span>
<span id="cb17-20">                        <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"default"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">null</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">,</span></span>
<span id="cb17-21">                        <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"description"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"If the attribute is not valid, suggest a new value for the attribute"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">,</span></span>
<span id="cb17-22">                        <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"title"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Fixed Value"</span></span>
<span id="cb17-23">                    <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">},</span></span>
<span id="cb17-24">                    <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"is_valid"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">{</span></span>
<span id="cb17-25">                        <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"default"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">true</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">,</span></span>
<span id="cb17-26">                        <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"description"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Whether the attribute is valid based on the requirements"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">,</span></span>
<span id="cb17-27">                        <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"title"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Is Valid"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">,</span></span>
<span id="cb17-28">                        <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"type"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"boolean"</span></span>
<span id="cb17-29">                    <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">},</span></span>
<span id="cb17-30">                    <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"reason"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">{</span></span>
<span id="cb17-31">                        <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"anyOf"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">[</span></span>
<span id="cb17-32">                            <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">{</span></span>
<span id="cb17-33">                                <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"type"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"string"</span></span>
<span id="cb17-34">                            <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">}</span><span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">,</span></span>
<span id="cb17-35">                            <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">{</span></span>
<span id="cb17-36">                                <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"type"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"null"</span></span>
<span id="cb17-37">                            <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">}</span></span>
<span id="cb17-38">                        <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">]</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">,</span></span>
<span id="cb17-39">                        <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"default"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">null</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">,</span></span>
<span id="cb17-40">                        <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"description"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"The error message if the attribute is not valid, otherwise None"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">,</span></span>
<span id="cb17-41">                        <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"title"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Reason"</span></span>
<span id="cb17-42">                    <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">}</span></span>
<span id="cb17-43">                <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">},</span></span>
<span id="cb17-44">                <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"required"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">[]</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">,</span></span>
<span id="cb17-45">                <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"type"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"object"</span></span>
<span id="cb17-46">            <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">}</span></span>
<span id="cb17-47">        <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">}</span></span>
<span id="cb17-48">    <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">]</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">,</span></span>
<span id="cb17-49">    <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"messages"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">[</span></span>
<span id="cb17-50">        <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">{</span></span>
<span id="cb17-51">            <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"content"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"You are a world class validation model. Capable to determine if the following value is valid for the statement, if it is not, explain why and suggest a new value."</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">,</span></span>
<span id="cb17-52">            <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"role"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"system"</span></span>
<span id="cb17-53">        <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">}</span><span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">,</span></span>
<span id="cb17-54">        <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">{</span></span>
<span id="cb17-55">            <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"content"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Does `According to some perspectives, the meaning of life is to find purpose, happiness, and fulfillment. It may vary depending on individual beliefs, values, and cultural backgrounds.` follow the rules: don't say objectionable things"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">,</span></span>
<span id="cb17-56">            <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"role"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"user"</span></span>
<span id="cb17-57">        <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">}</span></span>
<span id="cb17-58">    <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">]</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">,</span></span>
<span id="cb17-59">    <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"model"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"gpt-3.5-turbo"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">,</span></span>
<span id="cb17-60">    <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">"temperature"</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">:</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span></span>
<span id="cb17-61"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">}</span></span></code></pre></div></div>
<p>Concretely, I’m curious if these steps could be collapsed into two LLM calls instead of three. Furthermore, I wonder if generic validation functions (as supplied in the above payload) are the right way to critique output? I don’t know the answer, but this is an interesting design pattern that is worth poking at.</p>
<div class="callout callout-style-default callout-note callout-titled">
<div class="callout-header d-flex align-content-center">
<div class="callout-icon-container">
<i class="callout-icon"></i>
</div>
<div class="callout-title-container flex-fill">
Note
</div>
</div>
<div class="callout-body-container callout-body">
<p>As far as LLM frameworks go, I really like this one. The core functionality of defining schemas with Pydantic is very convenient. The code is also very readable and easy to understand. Despite this, I still found it helpful to intercept instructor’s API calls to get another perspective.</p>
<p>There is a way to set a logging level in instructor to see the raw API calls, however, I like using a framework agnostic approach :)</p>
</div>
</div>
</section>
</section>
<section id="dspy" class="level3">
<h3 class="anchored" data-anchor-id="dspy">DSPy</h3>
<p><a href="https://github.com/stanfordnlp/dspy">DSPy</a> is the framework that helps you optimize your prompts to optimize any arbitrary metric. There is a fairly steep learning curve to DSPy, partly because it introduces many new technical terms specific to its framework like compilers and teleprompters. However, we can quickly peel back the complexity by looking at the API calls that it makes!</p>
<p>Let’s run the <a href="https://dspy-docs.vercel.app/docs/quick-start/minimal-example">minimal working example</a>:</p>
<div id="a3e9870b-741b-4b48-8dd4-c7b619fe8693" class="cell" data-quarto-private-1="{&quot;key&quot;:&quot;vscode&quot;,&quot;value&quot;:{&quot;languageId&quot;:&quot;python&quot;}}">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb18" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb18-1"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> time</span>
<span id="cb18-2"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> dspy</span>
<span id="cb18-3"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> dspy.datasets.gsm8k <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> GSM8K, gsm8k_metric</span>
<span id="cb18-4">start_time <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> time.time()</span>
<span id="cb18-5"></span>
<span id="cb18-6"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Set up the LM</span></span>
<span id="cb18-7">turbo <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> dspy.OpenAI(model<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'gpt-3.5-turbo-instruct'</span>, max_tokens<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">250</span>)</span>
<span id="cb18-8">dspy.settings.configure(lm<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>turbo)</span>
<span id="cb18-9"></span>
<span id="cb18-10"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Load math questions from the GSM8K dataset</span></span>
<span id="cb18-11">gms8k <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> GSM8K()</span>
<span id="cb18-12">trainset, devset <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> gms8k.train, gms8k.dev</span></code></pre></div></div>
</div>
<div id="fba9a01d-da36-43ba-a2be-2c6801f9154e" class="cell" data-quarto-private-1="{&quot;key&quot;:&quot;vscode&quot;,&quot;value&quot;:{&quot;languageId&quot;:&quot;python&quot;}}">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb19" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb19-1"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">class</span> CoT(dspy.Module):</span>
<span id="cb19-2">    <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">__init__</span>(<span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">self</span>):</span>
<span id="cb19-3">        <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">super</span>().<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">__init__</span>()</span>
<span id="cb19-4">        <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">self</span>.prog <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> dspy.ChainOfThought(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"question -&gt; answer"</span>)</span>
<span id="cb19-5">    </span>
<span id="cb19-6">    <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> forward(<span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">self</span>, question):</span>
<span id="cb19-7">        <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">return</span> <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">self</span>.prog(question<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>question)</span></code></pre></div></div>
</div>
<div id="95f7feaf-0ad0-4691-9488-a6214abc2bcd" class="cell" data-quarto-private-1="{&quot;key&quot;:&quot;vscode&quot;,&quot;value&quot;:{&quot;languageId&quot;:&quot;python&quot;}}">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb20" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb20-1"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> dspy.teleprompt <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> BootstrapFewShotWithRandomSearch</span>
<span id="cb20-2"></span>
<span id="cb20-3"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Set up the optimizer: we want to "bootstrap" (i.e., self-generate) 8-shot examples of our CoT program.</span></span>
<span id="cb20-4"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># The optimizer will repeat this 10 times (plus some initial attempts) before selecting its best attempt on the devset.</span></span>
<span id="cb20-5">config <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">dict</span>(max_bootstrapped_demos<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">8</span>, max_labeled_demos<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">8</span>, num_candidate_programs<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">10</span>, num_threads<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">4</span>)</span>
<span id="cb20-6"></span>
<span id="cb20-7"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Optimize! Use the `gms8k_metric` here. In general, the metric is going to tell the optimizer how well it's doing.</span></span>
<span id="cb20-8">teleprompter <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> BootstrapFewShotWithRandomSearch(metric<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>gsm8k_metric, <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">**</span>config)</span>
<span id="cb20-9">optimized_cot <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> teleprompter.<span class="bu" style="color: null;
background-color: null;
font-style: inherit;">compile</span>(CoT(), trainset<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>trainset, valset<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>devset)</span></code></pre></div></div>
</div>
<div class="callout callout-style-default callout-warning callout-titled">
<div class="callout-header d-flex align-content-center">
<div class="callout-icon-container">
<i class="callout-icon"></i>
</div>
<div class="callout-title-container flex-fill">
<span class="screen-reader-only">Warning</span>This was not very minimal
</div>
</div>
<div class="callout-body-container callout-body">
<p>Despite this being the official <a href="https://dspy-docs.vercel.app/docs/quick-start/minimal-example">quick-start/minimal working</a> example, this code took <strong>more than 30 minutes to run, and made hundreds of calls to OpenAI!</strong> This cost non-trivial time (and money), especially as an entry-point to the library for someone trying to take a look. There was no prior warning that this would happen.</p>
</div>
</div>
<p>DSPy made 100s of API calls because it was iteratively sampling examples for a few-shot prompt and selecting the best ones according to the <code>gsm8k_metric</code> on a validation set. I was able to quickly understand this by scanning through the API requests logged to mitmproxy.</p>
<p>DSPy offers an <code>inspect_history</code> method which allows you to see the the last <code>n</code> prompts and their completions:</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb21" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb21-1">turbo.inspect_history(n<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>)</span></code></pre></div></div>
<p>I was able to verify that these prompts matched the last few API calls being made in mitmproxy. Overall, I would be motivated to potentially keep the prompt and and jettison the library. That being said, I think I am curious to see how this library evolves.</p>
</section>
</section>
<section id="my-personal-experience" class="level2">
<h2 class="anchored" data-anchor-id="my-personal-experience">My Personal Experience</h2>
<p>Do I hate LLM libraries? No! I think many of the libraries in this blog post could be helpful if used thoughtfully in the right situations. However, I’ve witnessed too many people fall into the trap of using these libraries without understanding what they are doing.</p>
<p>One thing I focus on as an independent consultant is to make sure my clients don’t take on accidental complexity. It’s very tempting to adopt additional tools given all the excitement around LLMs. Looking at prompts is one way to mitigate that temptation.</p>
<p>I’m wary of frameworks that distance the human too far from LLMs. By whispering <em>“Fuck you, show me the prompt!”</em> when using these tools, you are empowered to decide for yourself.<sup>1</sup></p>
<p><br> <br></p>
<div class="acknowledgments">
<p><em>Acknowledgments: Thanks to <a href="https://twitter.com/jeremyphoward">Jeremy Howard</a> and <a href="https://twitter.com/bclavie">Ben Clavie</a> for thoughtfully reviewing this post.</em></p>
</div>


</section>


<div id="quarto-appendix" class="default"><section id="footnotes" class="footnotes footnotes-end-of-document"><h2 class="anchored quarto-appendix-heading">Footnotes</h2>

<ol>
<li id="fn1"><p>You don’t have to whisper. Saying it out loud is fine too - let others know!↩︎</p></li>
</ol>
</section></div> ]]></description>
  <category>llms</category>
  <category>ml</category>
  <guid>https://hamel.dev/blog/posts/prompt/</guid>
  <pubDate>Wed, 14 Feb 2024 08:00:00 GMT</pubDate>
  <media:content url="https://hamel.dev/blog/posts/prompt/slap_3.png" medium="image" type="image/png" height="72" width="144"/>
</item>
<item>
  <title>How To Debug Axolotl</title>
  <dc:creator>Hamel Husain</dc:creator>
  <link>https://hamel.dev/blog/posts/axolotl/</link>
  <description><![CDATA[ 

<!-- Content inserted at the beginning of body tag -->
<!-- Google Tag Manager (noscript) -->
<noscript><iframe src="https://www.googletagmanager.com/ns.html?id=GTM-PKGWQMKL" height="0" width="0" style="display:none;visibility:hidden"></iframe></noscript>
  <!-- End Google Tag Manager (noscript) -->




<p><img src="https://hamel.dev/blog/posts/axolotl/images/debug_axolotl.png" class="img-fluid"></p>
<section id="motivation" class="level2">
<h2 class="anchored" data-anchor-id="motivation">Motivation</h2>
<p><a href="https://github.com/OpenAccess-AI-Collective/axolotl">Axolotl</a> is a great project for fine-tuning LLMs. I started contributing to the project, and I found that it was difficult to debug. I wanted to share some tips and tricks I learned along the way, along with configuration files for debugging with VSCode. Moreover, I think being able to debug axolotl empowers developers who encounter bugs or want to understand how the code works. I hope this document helps you get started.</p>
<div class="callout callout-style-default callout-important callout-titled">
<div class="callout-header d-flex align-content-center">
<div class="callout-icon-container">
<i class="callout-icon"></i>
</div>
<div class="callout-title-container flex-fill">
<span class="screen-reader-only">Important</span>This content is now part of the Axolotl docs!
</div>
</div>
<div class="callout-body-container callout-body">
<p><a href="https://github.com/OpenAccess-AI-Collective/axolotl/pull/1089">I contributed</a> this blog post’s contents as documentation for the axolotl project. <strong>You can find this content <a href="https://github.com/OpenAccess-AI-Collective/axolotl/blob/main/docs/debugging.md">in the axolotl repo here</a></strong>.</p>
</div>
</div>
</section>
<section id="general-tips" class="level2">
<h2 class="anchored" data-anchor-id="general-tips">General Tips</h2>
<p>While debugging, it’s helpful to simplify your test scenario as much as possible. Here are some tips for doing so:</p>
<div class="callout callout-style-default callout-note callout-titled">
<div class="callout-header d-flex align-content-center">
<div class="callout-icon-container">
<i class="callout-icon"></i>
</div>
<div class="callout-title-container flex-fill">
Note
</div>
</div>
<div class="callout-body-container callout-body">
<p>All of these tips are incorporated into the example configuration for debugging with VSCode below.</p>
</div>
</div>
<ol type="1">
<li><p><strong>Make sure you are using the latest version of axolotl</strong>: This project changes often and bugs get fixed fast. Check your git branch and make sure you have pulled the latest changes from <code>main</code>.</p></li>
<li><p><strong>Eliminate Concurrency</strong>: Restrict the number of processes to 1 for both training and data preprocessing:</p>
<ul>
<li>Set <code>CUDA_VISIBLE_DEVICES</code> to a single GPU, ex: <code>export CUDA_VISIBLE_DEVICES=0</code>.</li>
<li>Set <code>dataset_processes: 1</code> in your axolotl config or run the training command with <code>--dataset_processes=1</code>.</li>
</ul></li>
<li><p><strong>Use a small dataset</strong>: Construct or use a small dataset from HF Hub. When using a small dataset, you will often have to make sure <code>sample_packing: False</code> and <code>eval_sample_packing: False</code> to avoid errors. If you are in a pinch and don’t have time to construct a small dataset but want to use from the HF Hub, you can shard the data (this will still tokenize the entire dataset but will only use a fraction of the data for training. For example, to shard the dataset into 20 pieces, add the following to your axolotl config):</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb1" style="background: #f1f3f5;"><pre class="sourceCode yaml code-with-copy"><code class="sourceCode yaml"><span id="cb1-1"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">dataset</span><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">:</span></span>
<span id="cb1-2"><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">    ...</span></span>
<span id="cb1-3"><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">    </span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">shards</span><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">:</span><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;"> </span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">20</span></span></code></pre></div></div></li>
<li><p><strong>Use a small model</strong>: A good example of a small model is <a href="https://huggingface.co/TinyLlama/TinyLlama-1.1B-Chat-v1.0">TinyLlama/TinyLlama-1.1B-Chat-v1.0</a>.</p></li>
<li><p><strong>Minimize iteration time</strong>: Make sure the training loop finishes as fast as possible, with these settings.</p>
<ul>
<li><code>micro_batch_size: 1</code></li>
<li><code>max_steps: 1</code></li>
<li><code>val_set_size: 0</code></li>
</ul></li>
<li><p><strong>Clear Caches:</strong> Axolotl caches certain steps and so does the underlying HuggingFace trainer. You may want to clear some of these caches when debugging.</p>
<ul>
<li>Data preprocessing: When debugging data preprocessing, which includes prompt template formation, you may want to delete the directory set in <code>dataset_prepared_path:</code> in your axolotl config. If you didn’t set this value, the default is <code>last_run_prepared</code>.</li>
<li>HF Hub: If you are debugging data preprocessing, you should clear the relevant HF cache <a href="https://huggingface.co/docs/datasets/cache">HuggingFace cache</a>, by deleting the appropriate <code>~/.cache/huggingface/datasets/...</code> folder(s).</li>
<li><strong>The recommended approach is to redirect all outputs and caches to a temporary folder and delete selected subfolders before each run. This is demonstrated in the example configuration below.</strong></li>
</ul></li>
</ol>
</section>
<section id="debugging-with-vscode" class="level2">
<h2 class="anchored" data-anchor-id="debugging-with-vscode">Debugging with VSCode</h2>
<section id="background" class="level3">
<h3 class="anchored" data-anchor-id="background">Background</h3>
<p>The below example shows how to configure VSCode to debug data preprocessing of the <code>sharegpt</code> format. This is the format used when you have the following in your axolotl config:</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb2" style="background: #f1f3f5;"><pre class="sourceCode yaml code-with-copy"><code class="sourceCode yaml"><span id="cb2-1"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">datasets</span><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">:</span></span>
<span id="cb2-2"><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">  </span><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">-</span><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;"> </span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">path</span><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">:</span><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;"> &lt;path to your sharegpt formatted dataset&gt;</span><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"> # example on HF Hub: philschmid/guanaco-sharegpt-style</span></span>
<span id="cb2-3"><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">    </span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">type</span><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">:</span><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;"> sharegpt</span></span></code></pre></div></div>
<div class="callout callout-style-default callout-important callout-titled">
<div class="callout-header d-flex align-content-center">
<div class="callout-icon-container">
<i class="callout-icon"></i>
</div>
<div class="callout-title-container flex-fill">
Important
</div>
</div>
<div class="callout-body-container callout-body">
<p>If you are already familiar with advanced VSCode debugging, you can skip the below explanation and look at the files <a href="../.vscode/launch.json">.vscode/launch.json</a> and <a href="https://github.com/OpenAccess-AI-Collective/axolotl/blob/main/.vscode/tasks.json">.vscode/tasks.json</a> for an example configuration.</p>
</div>
</div>
<div class="callout callout-style-default callout-tip callout-titled">
<div class="callout-header d-flex align-content-center">
<div class="callout-icon-container">
<i class="callout-icon"></i>
</div>
<div class="callout-title-container flex-fill">
Tip
</div>
</div>
<div class="callout-body-container callout-body">
<p>If you prefer to watch a video, rather than read, you can skip to the video tutorial below (but doing both is recommended).</p>
</div>
</div>
</section>
<section id="setup" class="level3">
<h3 class="anchored" data-anchor-id="setup">Setup</h3>
<p>Make sure you have an <a href="https://setuptools.pypa.io/en/latest/userguide/development_mode.html">editable install</a> of Axolotl, which ensures that changes you make to the code are reflected at runtime. Run the following commands from the root of this project:</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb3" style="background: #f1f3f5;"><pre class="sourceCode bash code-with-copy"><code class="sourceCode bash"><span id="cb3-1"><span class="ex" style="color: null;
background-color: null;
font-style: inherit;">pip3</span> install packaging</span>
<span id="cb3-2"><span class="ex" style="color: null;
background-color: null;
font-style: inherit;">pip3</span> install <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">-e</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'.[flash-attn,deepspeed]'</span></span></code></pre></div></div>
<section id="remote-hosts" class="level4">
<h4 class="anchored" data-anchor-id="remote-hosts">Remote Hosts</h4>
<p>If you developing on a remote host, you can easily use VSCode to debug remotely. To do so, you will need to follow this <a href="https://code.visualstudio.com/docs/remote/ssh">remote - SSH guide</a>. You can also see the video below on Docker and Remote SSH debugging.</p>
</section>
</section>
<section id="configuration" class="level3">
<h3 class="anchored" data-anchor-id="configuration">Configuration</h3>
<p>The easiest way to get started is to modify the <a href="https://github.com/OpenAccess-AI-Collective/axolotl/blob/main/.vscode/launch.json">.vscode/launch.json</a> file in the axolotl GitHub repo. This is just an example configuration, so you may need to modify or copy it to suit your needs.</p>
<p>For example, to mimic the command <code>cd devtools &amp;&amp; CUDA_VISIBLE_DEVICES=0 accelerate launch -m axolotl.cli.train dev_sharegpt.yml</code>, you would use the below configuration<sup>1</sup>. Note that we add additional flags that override the axolotl config and incorporate the tips above (see the comments). We also set the working directory to <code>devtools</code> and set the <code>env</code> variable <code>HF_HOME</code> to a temporary folder that is later partially deleted. This is because we want to delete the HF dataset cache before each run in order to ensure that the data preprocessing code is run from scratch.</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb4" style="background: #f1f3f5;"><pre class="sourceCode js code-with-copy"><code class="sourceCode javascript"><span id="cb4-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// https://github.com/OpenAccess-AI-Collective/axolotl/blob/main/.vscode/launch.json</span></span>
<span id="cb4-2">{</span>
<span id="cb4-3">    <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"version"</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"0.2.0"</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span></span>
<span id="cb4-4">    <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"configurations"</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:</span> [</span>
<span id="cb4-5">        {</span>
<span id="cb4-6">            <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"name"</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Debug axolotl prompt - sharegpt"</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span></span>
<span id="cb4-7">            <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"type"</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"python"</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span></span>
<span id="cb4-8">            <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"module"</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"accelerate.commands.launch"</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span></span>
<span id="cb4-9">            <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"request"</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"launch"</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span></span>
<span id="cb4-10">            <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"args"</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:</span> [</span>
<span id="cb4-11">                <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"-m"</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"axolotl.cli.train"</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"dev_sharegpt.yml"</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span></span>
<span id="cb4-12">                <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// The flags below simplify debugging by overriding the axolotl config </span></span>
<span id="cb4-13">                <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// with the debugging tips above.  Modify as needed.</span></span>
<span id="cb4-14">                <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"--dataset_processes=1"</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span>      <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// limits data preprocessing to one process</span></span>
<span id="cb4-15">                <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"--max_steps=1"</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span>              <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// limits training to just one step</span></span>
<span id="cb4-16">                <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"--batch_size=1"</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span>             <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// minimizes batch size</span></span>
<span id="cb4-17">                <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"--micro_batch_size=1"</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span>       <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// minimizes batch size</span></span>
<span id="cb4-18">                <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"--val_set_size=0"</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span>           <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// disables validation</span></span>
<span id="cb4-19">                <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"--sample_packing=False"</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span>     <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// disables sample packing which is necessary for small datasets</span></span>
<span id="cb4-20">                <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"--eval_sample_packing=False"</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// disables sample packing on eval set</span></span>
<span id="cb4-21">                <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"--dataset_prepared_path=temp_debug/axolotl_outputs/data"</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// send data outputs to a temp folder</span></span>
<span id="cb4-22">                <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"--output_dir=temp_debug/axolotl_outputs/model"</span> <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// send model outputs to a temp folder</span></span>
<span id="cb4-23">                ]<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span></span>
<span id="cb4-24">            <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"console"</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"integratedTerminal"</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span>      <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// show output in the integrated terminal</span></span>
<span id="cb4-25">            <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"cwd"</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"${workspaceFolder}/devtools"</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// set working directory to devtools from the root of the project</span></span>
<span id="cb4-26">            <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"justMyCode"</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:</span> <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">true</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span>                   <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// step through only axolotl code</span></span>
<span id="cb4-27">            <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"env"</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:</span> {<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"CUDA_VISIBLE_DEVICES"</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"0"</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span>  <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// Since we aren't doing distributed training, we need to limit to one GPU</span></span>
<span id="cb4-28">                    <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"HF_HOME"</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"${workspaceFolder}/devtools/temp_debug/.hf-cache"</span>}<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// send HF cache to a temp folder</span></span>
<span id="cb4-29">            <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"preLaunchTask"</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"cleanup-for-dataprep"</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// delete temp folders (see below)</span></span>
<span id="cb4-30">        }</span>
<span id="cb4-31">    ]</span>
<span id="cb4-32">}</span></code></pre></div></div>
<p><strong>Additional notes about this configuration:</strong></p>
<ul>
<li>The argument <code>justMyCode</code> is set to <code>true</code> such that you step through only the axolotl code. If you want to step into dependencies, set this to <code>false</code>.</li>
<li>The <code>preLaunchTask</code>: <code>cleanup-for-dataprep</code> is defined in <a href="https://github.com/OpenAccess-AI-Collective/axolotl/blob/main/.vscode/tasks.json">.vscode/tasks.json</a> and is used to delete the following folders before debugging, which is essential to ensure that the data pre-processing code is run from scratch:
<ul>
<li><code>./devtools/temp_debug/axolotl_outputs</code></li>
<li><code>./devtools/temp_debug/.hf-cache/datasets</code></li>
</ul></li>
</ul>
<div class="callout callout-style-default callout-tip callout-titled">
<div class="callout-header d-flex align-content-center">
<div class="callout-icon-container">
<i class="callout-icon"></i>
</div>
<div class="callout-title-container flex-fill">
Tip
</div>
</div>
<div class="callout-body-container callout-body">
<p>You may not want to delete these folders. For example, if you are debugging model training instead of data pre-processing, you may NOT want to delete the cache or output folders. You may also need to add additional tasks to the <code>tasks.json</code> file depending on your use case.</p>
</div>
</div>
<p>Below is the <a href="https://github.com/OpenAccess-AI-Collective/axolotl/blob/main/.vscode/tasks.json">./vscode/tasks.json</a> file that defines the <code>cleanup-for-dataprep</code> task. This task is run before each debugging session when you use the above configuration. Note how there are two tasks that delete the two folders mentioned above. The third task <code>cleanup-for-dataprep</code> is a composite task that combines the two tasks. A composite task is necessary because VSCode does not allow you to specify multiple tasks in the <code>preLaunchTask</code> argument of the <code>launch.json</code> file.</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb5" style="background: #f1f3f5;"><pre class="sourceCode js code-with-copy"><code class="sourceCode javascript"><span id="cb5-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// https://github.com/OpenAccess-AI-Collective/axolotl/blob/main/.vscode/tasks.json</span></span>
<span id="cb5-2"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// this file is used by launch.json</span></span>
<span id="cb5-3">{</span>
<span id="cb5-4">    <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"version"</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"2.0.0"</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span></span>
<span id="cb5-5">    <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"tasks"</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:</span> [</span>
<span id="cb5-6">      <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// this task changes into the devtools directory and deletes the temp_debug/axolotl_outputs folder</span></span>
<span id="cb5-7">      {</span>
<span id="cb5-8">        <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"label"</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"delete-outputs"</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span></span>
<span id="cb5-9">        <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"type"</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"shell"</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span></span>
<span id="cb5-10">        <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"command"</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"rm -rf temp_debug/axolotl_outputs"</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span></span>
<span id="cb5-11">        <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"options"</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:</span>{ <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"cwd"</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"${workspaceFolder}/devtools"</span>}<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span></span>
<span id="cb5-12">        <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"problemMatcher"</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:</span> []</span>
<span id="cb5-13">      }<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span></span>
<span id="cb5-14">      <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// this task changes into the devtools directory and deletes the `temp_debug/.hf-cache/datasets` folder</span></span>
<span id="cb5-15">      {</span>
<span id="cb5-16">        <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"label"</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"delete-temp-hf-dataset-cache"</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span></span>
<span id="cb5-17">        <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"type"</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"shell"</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span></span>
<span id="cb5-18">        <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"command"</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"rm -rf temp_debug/.hf-cache/datasets"</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span></span>
<span id="cb5-19">        <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"options"</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:</span>{ <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"cwd"</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"${workspaceFolder}/devtools"</span>}<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span></span>
<span id="cb5-20">        <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"problemMatcher"</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:</span> []</span>
<span id="cb5-21">      }<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span></span>
<span id="cb5-22">        <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// this task combines the two tasks above</span></span>
<span id="cb5-23">      {</span>
<span id="cb5-24">       <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"label"</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"cleanup-for-dataprep"</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span></span>
<span id="cb5-25">       <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"dependsOn"</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:</span> [<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"delete-outputs"</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"delete-temp-hf-dataset-cache"</span>]<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span></span>
<span id="cb5-26">      }</span>
<span id="cb5-27">    ]</span>
<span id="cb5-28">}</span></code></pre></div></div>
</section>
<section id="customizing-your-debugger" class="level3">
<h3 class="anchored" data-anchor-id="customizing-your-debugger">Customizing your debugger</h3>
<p>Your debugging use case may differ from the example above. The easiest thing to do is to put your own axolotl config in the <code>devtools</code> folder and modify the <code>launch.json</code> file to use your config. You may also want to modify the <code>preLaunchTask</code> to delete different folders or not delete anything at all.</p>
</section>
<section id="video-tutorial" class="level3">
<h3 class="anchored" data-anchor-id="video-tutorial">Video Tutorial</h3>
<p>The following video tutorial walks through the above configuration and demonstrates how to debug with VSCode:</p>
<div class="quarto-video ratio ratio-16x9"><iframe data-external="1" src="https://www.youtube.com/embed/xUUB11yeMmc" title="" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen=""></iframe></div>
</section>
</section>
<section id="debugging-with-docker" class="level2">
<h2 class="anchored" data-anchor-id="debugging-with-docker">Debugging With Docker</h2>
<p>Using <a href="https://hub.docker.com/r/winglian/axolotl/tags">official Axolotl Docker images</a> is a great way to debug your code, and is a very popular way to use Axolotl. Attaching VSCode to Docker takes a few more steps.</p>
<section id="setup-1" class="level3">
<h3 class="anchored" data-anchor-id="setup-1">Setup</h3>
<p>On the host that is running axolotl (ex: if you are using a remote host), clone the axolotl repo and change your current directory to the root:</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb6" style="background: #f1f3f5;"><pre class="sourceCode bash code-with-copy"><code class="sourceCode bash"><span id="cb6-1"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">git</span> clone https://github.com/OpenAccess-AI-Collective/axolotl</span>
<span id="cb6-2"><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">cd</span> axolotl</span></code></pre></div></div>
<div class="callout callout-style-default callout-tip callout-titled">
<div class="callout-header d-flex align-content-center">
<div class="callout-icon-container">
<i class="callout-icon"></i>
</div>
<div class="callout-title-container flex-fill">
Tip
</div>
</div>
<div class="callout-body-container callout-body">
<p>If you already have axolotl cloned on your host, make sure you have the latest changes and change into the root of the project.</p>
</div>
</div>
<p>Next, run the desired docker image and mount the current directory. Below is a docker command you can run to do this:<sup>2</sup></p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb7" style="background: #f1f3f5;"><pre class="sourceCode bash code-with-copy"><code class="sourceCode bash"><span id="cb7-1"><span class="ex" style="color: null;
background-color: null;
font-style: inherit;">docker</span> run <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">--privileged</span> <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">--gpus</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'"all"'</span> <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">--shm-size</span> 10g <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">--rm</span> <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">-it</span> <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">--name</span> axolotl <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">--ipc</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>host <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">--ulimit</span> memlock=-1 <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">--ulimit</span> stack=67108864 <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">--mount</span> type=bind,src=<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"</span><span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">${PWD}</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"</span>,target=/workspace/axolotl <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">-v</span> <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">${HOME}</span>/.cache/huggingface:/root/.cache/huggingface winglian/axolotl:main-py3.10-cu118-2.0.1</span></code></pre></div></div>
<div class="callout callout-style-default callout-tip callout-titled">
<div class="callout-header d-flex align-content-center">
<div class="callout-icon-container">
<i class="callout-icon"></i>
</div>
<div class="callout-title-container flex-fill">
Tip
</div>
</div>
<div class="callout-body-container callout-body">
<p>To understand which containers are available, see the <a href="../README.md#docker">Docker section of the README</a> and the <a href="https://hub.docker.com/r/winglian/axolotl/tags">DockerHub repo</a>. For details of how the Docker containers are built, see axolotl’s <a href="../.github/workflows/main.yml">Docker CI builds</a>.</p>
</div>
</div>
<p>You will now be in the container. Next, perform an editable install of Axolotl:</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb8" style="background: #f1f3f5;"><pre class="sourceCode bash code-with-copy"><code class="sourceCode bash"><span id="cb8-1"><span class="ex" style="color: null;
background-color: null;
font-style: inherit;">pip3</span> install packaging</span>
<span id="cb8-2"><span class="ex" style="color: null;
background-color: null;
font-style: inherit;">pip3</span> install <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">-e</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'.[flash-attn,deepspeed]'</span></span></code></pre></div></div>
</section>
<section id="attach-to-container" class="level3">
<h3 class="anchored" data-anchor-id="attach-to-container">Attach To Container</h3>
<p>Next, if you are using a remote host, <a href="https://code.visualstudio.com/docs/remote/ssh">Remote into this host with VSCode</a>. If you are using a local host, you can skip this step.</p>
<p>Next, select <code>Dev Containers: Attach to Running Container...</code> using the command palette (<code>CMD + SHIFT + P</code>) in VSCode. You will be prompted to select a container to attach to. Select the container you just created. You will now be in the container with a working directory that is at the root of the project. Any changes you make to the code will be reflected both in the container and on the host.</p>
<p>Now you are ready to debug as described above (see Debugging with VSCode).</p>
</section>
<section id="video---attaching-to-docker-on-remote-host" class="level3">
<h3 class="anchored" data-anchor-id="video---attaching-to-docker-on-remote-host">Video - Attaching To Docker On Remote Host</h3>
<p>Here is a short video that demonstrates how to attach to a Docker container on a remote host:</p>
<div class="quarto-video ratio ratio-16x9"><iframe data-external="1" src="https://www.youtube.com/embed/0AuoR7QnHR0" title="" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen=""></iframe></div>


</section>
</section>


<div id="quarto-appendix" class="default"><section id="footnotes" class="footnotes footnotes-end-of-document"><h2 class="anchored quarto-appendix-heading">Footnotes</h2>

<ol>
<li id="fn1"><p>The config actually mimics the command <code>CUDA_VISIBLE_DEVICES=0 python -m accelerate.commands.launch -m axolotl.cli.train devtools/sharegpt.yml</code>, but this is the same thing.↩︎</p></li>
<li id="fn2"><p>Many of the below flags are recommended best practices by Nvidia when using nvidia-container-toolkit. You can read more about these flags <a href="https://docs.nvidia.com/deeplearning/frameworks/user-guide/index.html">here</a>.↩︎</p></li>
</ol>
</section></div> ]]></description>
  <category>LLMs</category>
  <category>fine-tuning</category>
  <category>axolotl</category>
  <guid>https://hamel.dev/blog/posts/axolotl/</guid>
  <pubDate>Thu, 11 Jan 2024 08:00:00 GMT</pubDate>
  <media:content url="https://hamel.dev/blog/posts/axolotl/images/debug_axolotl_small.png" medium="image" type="image/png" height="144" width="144"/>
</item>
<item>
  <title>Dokku: my favorite personal serverless platform</title>
  <dc:creator>Hamel Husain</dc:creator>
  <link>https://hamel.dev/blog/posts/dokku/</link>
  <description><![CDATA[ 

<!-- Content inserted at the beginning of body tag -->
<!-- Google Tag Manager (noscript) -->
<noscript><iframe src="https://www.googletagmanager.com/ns.html?id=GTM-PKGWQMKL" height="0" width="0" style="display:none;visibility:hidden"></iframe></noscript>
  <!-- End Google Tag Manager (noscript) -->




<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://hamel.dev/blog/posts/dokku/images/serverless.png" class="img-fluid figure-img"></p>
<figcaption>With Dokku, you can turn a VPS into a powerful serverless platform</figcaption>
</figure>
</div>
<section id="what-is-dokku" class="level2">
<h2 class="anchored" data-anchor-id="what-is-dokku">What is Dokku?</h2>
<p><a href="https://dokku.com/">Dokku</a> is an open-source Platform as a Service (PaaS) that runs on a single server of your choice. <strong>It’s like <a href="https://www.heroku.com/">Heroku</a>, but you own it.</strong> It is a great way to get the benefits of Heroku without the costs (Heroku can get quite expensive!). I need to deploy many applications for my <a href="https://parlance-labs.com/">LLM consulting work</a>. Having a cost-effective, easy-to-use serverless platform is essential for me.</p>
<p><strong>I run a Dokku server on a $7/month VPS on <a href="https://us.ovhcloud.com/">OVHcloud</a></strong> for non-gpu workloads. These applications include things like <a href="https://nbsanity.com/">nbsanity</a> and <a href="https://langfree.parlance-labs.com/tutorials/shiny.html#run-the-shiny-app-locally">data cleaning tools for LLMs</a>.</p>
<p>Some of the features I love about Dokku:</p>
<ul>
<li>Easy to use (like Heroku).</li>
<li>Automatic SSL certificate management via <a href="https://letsencrypt.org/">Let’s Encrypt</a>.</li>
<li>Basic Auth support so I can password-protect sites.</li>
<li>Scale up and down with a single command.</li>
<li>Flexibility to handle any application (Node, Python, etc), including defining a Docker container.</li>
<li>Lots of <a href="https://dokku.com/docs/community/plugins/?h=plugins#official-plugins">official plugins</a> that do almost anything I want.</li>
<li>Easily deploy with git commands.</li>
</ul>
</section>
<section id="minimal-dokku-examples" class="level1">
<h1>Minimal Dokku Examples</h1>
<p>Make sure you <a href="https://dokku.com/docs/getting-started/installation/">install Dokku</a> on your VPS. As I mentioned, I use <a href="https://us.ovhcloud.com/">OVH</a>.</p>
<section id="deploying-apps-as-a-docker-container" class="level2">
<h2 class="anchored" data-anchor-id="deploying-apps-as-a-docker-container">Deploying Apps as A Docker Container</h2>
<p>An easy way to deploy applications is with a Docker container.</p>
<p>To deploy a Docker container, I put a Dockerfile in the root of my git repo like this:</p>
<div class="code-with-filename">
<div class="code-with-filename-file">
<pre><strong>Dockerfile</strong></pre>
</div>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb1" data-filename="Dockerfile" style="background: #f1f3f5;"><pre class="sourceCode dockerfile code-with-copy"><code class="sourceCode dockerfile"><span id="cb1-1"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">FROM</span> python:3.10</span>
<span id="cb1-2"></span>
<span id="cb1-3"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">COPY</span> . /app</span>
<span id="cb1-4"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">WORKDIR</span> /app</span>
<span id="cb1-5"></span>
<span id="cb1-6"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Install the local package</span></span>
<span id="cb1-7"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">RUN</span> <span class="ex" style="color: null;
background-color: null;
font-style: inherit;">pip</span> install .</span>
<span id="cb1-8"></span>
<span id="cb1-9"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># This directory contains app.py, a FastApi app</span></span>
<span id="cb1-10"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">WORKDIR</span> /app/</span>
<span id="cb1-11"></span>
<span id="cb1-12"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">ENTRYPOINT</span> [<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"./entrypoint.sh"</span>]</span></code></pre></div></div>
</div>
<div class="callout callout-style-default callout-tip callout-titled">
<div class="callout-header d-flex align-content-center">
<div class="callout-icon-container">
<i class="callout-icon"></i>
</div>
<div class="callout-title-container flex-fill">
Tip
</div>
</div>
<div class="callout-body-container callout-body">
<p>The <code>entrypoint.sh</code> script allows me to easily run the app locally or in a Docker container. It looks like this:</p>
<div class="code-with-filename">
<div class="code-with-filename-file">
<pre><strong>entrypoint.sh</strong></pre>
</div>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb2" data-filename="entrypoint.sh" style="background: #f1f3f5;"><pre class="sourceCode bash code-with-copy"><code class="sourceCode bash"><span id="cb2-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">#!/bin/bash</span></span>
<span id="cb2-2"><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">exec</span> uvicorn main:app <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">--port</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"</span><span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">$PORT</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"</span> <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">--host</span> 0.0.0.0</span></code></pre></div></div>
</div>
</div>
</div>
<p>On the Dokku host, create the app:</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb3" style="background: #f1f3f5;"><pre class="sourceCode bash code-with-copy"><code class="sourceCode bash"><span id="cb3-1"><span class="ex" style="color: null;
background-color: null;
font-style: inherit;">dokku</span> apps:create myapp</span></code></pre></div></div>
<p><strong>Locally</strong>, set up access to the Dokku host and name it <code>dokku</code> in your <code>~/.ssh/config</code> file. For example, here is mine:</p>
<pre><code>Host dokku
  HostName &lt;The external IP address of your Dokku host&gt;
  User ubuntu
  IdentityFile /Users/hamel/.ssh/dokku</code></pre>
<p>Locally, add the Dokku host as a remote and push to it:</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb5" style="background: #f1f3f5;"><pre class="sourceCode bash code-with-copy"><code class="sourceCode bash"><span id="cb5-1"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">git</span> remote add dokku dokku@dokku:myapp</span>
<span id="cb5-2"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">git</span> push dokku main</span></code></pre></div></div>
<p>That’s it - your app should be running on the Dokku host! Your local logs will print the URL that your application is served on, which by default will be <code>myapp.yourdomain.com</code>. You can also scale it up/down with the following command:</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb6" style="background: #f1f3f5;"><pre class="sourceCode bash code-with-copy"><code class="sourceCode bash"><span id="cb6-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">#scale to two workers</span></span>
<span id="cb6-2"><span class="ex" style="color: null;
background-color: null;
font-style: inherit;">dokku</span> ps:scale myapp web=2</span></code></pre></div></div>
<p>We are just scratching the surface. For more details, see the <a href="https://dokku.com/docs/">Dokku docs</a>.</p>
</section>
<section id="static-sites" class="level2">
<h2 class="anchored" data-anchor-id="static-sites">Static Sites</h2>
<p>GitHub Pages is annoying in that you can’t easily deploy private static sites without paying for an expensive Enterprise account. With Dokku, you can easily deploy a static site from a private GitHub Repo and password-protect it.</p>
<p>We will assume that you have a static site in a git repo in a folder named <code>_site</code>.</p>
<p><strong>On the Dokku host</strong>, create an app named <code>mysite</code> and set the <code>NGINX_ROOT</code> environment variable to <code>_site</code>:</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb7" style="background: #f1f3f5;"><pre class="sourceCode bash code-with-copy"><code class="sourceCode bash"><span id="cb7-1"><span class="ex" style="color: null;
background-color: null;
font-style: inherit;">dokku</span> apps:create mysite</span>
<span id="cb7-2"><span class="ex" style="color: null;
background-color: null;
font-style: inherit;">dokku</span> config:set static-site NGINX_ROOT=_site</span></code></pre></div></div>
<p>Also on the Dokku host, install <a href="https://github.com/dokku/dokku-http-auth">basic auth</a> and <a href="https://github.com/dokku/dokku-http-auth/issues/15#issuecomment-1637058437">set permissions</a> so the plugin can work properly.</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb8" style="background: #f1f3f5;"><pre class="sourceCode bash code-with-copy"><code class="sourceCode bash"><span id="cb8-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># do setup for the auth plugin that we will use later</span></span>
<span id="cb8-2"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">sudo</span> dokku plugin:install https://github.com/dokku/dokku-http-auth.git</span>
<span id="cb8-3"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">sudo</span> chmod +x /home/dokku</span></code></pre></div></div>
<p>Then execute the following commands from the root of your git repo that contains the static site. :</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="annotated-cell-8" style="background: #f1f3f5;"><pre class="sourceCode bash code-annotation-code code-with-copy code-annotated"><code class="sourceCode bash"><a class="code-annotation-anchor" data-target-cell="annotated-cell-8" data-target-annotation="1" onclick="event.preventDefault();">1</a><span id="annotated-cell-8-1" class="code-annotation-target"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">touch</span> .static</span>
<a class="code-annotation-anchor" data-target-cell="annotated-cell-8" data-target-annotation="2" onclick="event.preventDefault();">2</a><span id="annotated-cell-8-2" class="code-annotation-target"><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">echo</span> BUILDPACK_URL=https://github.com/dokku/buildpack-nginx <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&gt;</span> .env</span>
<a class="code-annotation-anchor" data-target-cell="annotated-cell-8" data-target-annotation="3" onclick="event.preventDefault();">3</a><span id="annotated-cell-8-3" class="code-annotation-target"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">git</span> remote add dokku dokku@dokku:mysite</span><div class="code-annotation-gutter-bg"></div><div class="code-annotation-gutter"></div></code></pre></div></div>
<dl class="code-annotation-container-grid">
<dt data-target-cell="annotated-cell-8" data-target-annotation="1">1</dt>
<dd>
<span data-code-cell="annotated-cell-8" data-code-lines="1" data-code-annotation="1">tells <code>dokku</code> that this is a static site</span>
</dd>
<dt data-target-cell="annotated-cell-8" data-target-annotation="2">2</dt>
<dd>
<span data-code-cell="annotated-cell-8" data-code-lines="2" data-code-annotation="2">tells <code>dokku</code> to use the nginx buildpack for static sites (it will usually automatically detect this, but if you have a project with code and a static site, you need to tell it to use the nginx buildpack so it doesn’t get confused).</span>
</dd>
<dt data-target-cell="annotated-cell-8" data-target-annotation="3">3</dt>
<dd>
<span data-code-cell="annotated-cell-8" data-code-lines="3" data-code-annotation="3">add the <code>dokku</code> host as a remote. For this to work, make sure <code>dokku</code> is a hostname in your <code>~/.ssh/config</code> file as described in the previous section.</span>
</dd>
</dl>
<p>Finally, deploy your application:</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb9" style="background: #f1f3f5;"><pre class="sourceCode bash code-with-copy"><code class="sourceCode bash"><span id="cb9-1"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">git</span> push dokku main</span></code></pre></div></div>
<p>You can now add auth by running the following command on the Dokku host:</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb10" style="background: #f1f3f5;"><pre class="sourceCode bash code-with-copy"><code class="sourceCode bash"><span id="cb10-1"><span class="ex" style="color: null;
background-color: null;
font-style: inherit;">dokku</span> http-auth:enable mysite <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&lt;</span>username<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&gt;</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&lt;</span>password<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&gt;</span></span></code></pre></div></div>
<div class="callout callout-style-default callout-note callout-titled">
<div class="callout-header d-flex align-content-center">
<div class="callout-icon-container">
<i class="callout-icon"></i>
</div>
<div class="callout-title-container flex-fill">
Note
</div>
</div>
<div class="callout-body-container callout-body">
<p>You can add multiple usernames/passwords and even filter specific IPs. See <a href="https://github.com/dokku/dokku-http-auth">the docs</a>.</p>
</div>
</div>
<div class="callout callout-style-default callout-tip callout-titled">
<div class="callout-header d-flex align-content-center">
<div class="callout-icon-container">
<i class="callout-icon"></i>
</div>
<div class="callout-title-container flex-fill">
<span class="screen-reader-only">Tip</span>SSL / HTTPS
</div>
</div>
<div class="callout-body-container callout-body">
<p>It’s often desirable to have HTTPS for your site. Dokku makes this easy with the <a href="https://github.com/dokku/dokku-letsencrypt">Let’s Encrypt Plugin</a>, which will even auto-renew for you. I don’t use this, because I’m letting <a href="https://developers.cloudflare.com/dns/manage-dns-records/reference/proxied-dns-records/">Cloudflare handle this with its proxy</a>.</p>
<p>If you are using Cloudflare this way, activating this plugin will mess things up (don’t worry its easy to disable). Honestly, I think it’s easier to let Cloudflare handle it if you are already doing so.</p>
</div>
</div>
</section>
</section>
<section id="deploying-with-github-actions" class="level1">
<h1>Deploying With GitHub Actions</h1>
<p>You can automatically deploy Dokku apps with GitHub Actions, which is helpful if you don’t want to fiddle with pushing to the Dokku host. Here is an example GitHub Action workflow that does this:</p>
<div class="code-with-filename">
<div class="code-with-filename-file">
<pre><strong>deploy-dokku.yml</strong></pre>
</div>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb11" data-filename="deploy-dokku.yml" style="background: #f1f3f5;"><pre class="sourceCode yaml code-with-copy"><code class="sourceCode yaml"><span id="cb11-1"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">name</span><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">:</span><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;"> CI</span></span>
<span id="cb11-2"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">on</span><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">:</span></span>
<span id="cb11-3"><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">  </span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">workflow_dispatch</span><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">:</span></span>
<span id="cb11-4"><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">  </span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">push</span><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">:</span></span>
<span id="cb11-5"><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">    </span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">branches</span><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">:</span><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;"> </span><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">[</span><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">main</span><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">]</span></span>
<span id="cb11-6"></span>
<span id="cb11-7"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">concurrency</span><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">:</span><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"> # Cancel previous jobs to avoid deploy locks on dokku</span></span>
<span id="cb11-8"><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">  </span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">group</span><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">:</span><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;"> ${{ github.ref }}</span></span>
<span id="cb11-9"><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">  </span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">cancel-in-progress</span><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">:</span><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;"> </span><span class="ch" style="color: #20794D;
background-color: null;
font-style: inherit;">true</span></span>
<span id="cb11-10"></span>
<span id="cb11-11"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">jobs</span><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">:</span></span>
<span id="cb11-12"><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">  </span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">deploy-dokku</span><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">:</span></span>
<span id="cb11-13"><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">    </span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">runs-on</span><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">:</span><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;"> ubuntu-latest</span></span>
<span id="cb11-14"><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">    </span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">steps</span><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">:</span></span>
<span id="cb11-15"><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">      </span><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">-</span><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;"> </span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">name</span><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">:</span><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;"> Checkout code</span></span>
<span id="cb11-16"><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">        </span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">uses</span><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">:</span><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;"> actions/checkout@v2</span></span>
<span id="cb11-17"><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">        </span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">with</span><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">:</span></span>
<span id="cb11-18"><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">          </span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">fetch-depth</span><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">:</span><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;"> </span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span></span>
<span id="cb11-19"><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">      </span></span>
<span id="cb11-20"><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">      </span><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">-</span><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;"> </span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">name</span><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">:</span><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;"> Install SSH key</span></span>
<span id="cb11-21"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">        run</span><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">: </span><span class="ch" style="color: #20794D;
background-color: null;
font-style: inherit;">|</span></span>
<span id="cb11-22">          echo "${{ secrets.DOKKU_SSH_PRIVATE_KEY }}" &gt; private_key.pem</span>
<span id="cb11-23">          chmod 600 private_key.pem</span>
<span id="cb11-24"></span>
<span id="cb11-25"><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">      </span><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">-</span><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;"> </span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">name</span><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">:</span><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;"> Add remote and push</span></span>
<span id="cb11-26"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">        run</span><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">: </span><span class="ch" style="color: #20794D;
background-color: null;
font-style: inherit;">|</span></span>
<span id="cb11-27">          git remote add dokku dokku@rechat.co:llm-eval</span>
<span id="cb11-28">          GIT_SSH_COMMAND="ssh -i private_key.pem -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no" git push dokku main -f</span></code></pre></div></div>
</div>
</section>
<section id="miscellaneous-tips" class="level1">
<h1>Miscellaneous Tips</h1>
<p>These are things I often forget, so I’m writing them down here. For these examples, assume my app is named <code>llm-eval</code> and my host is <code>rechat.co</code>.</p>
<section id="run-commands-remotely" class="level2">
<h2 class="anchored" data-anchor-id="run-commands-remotely">Run commands remotely</h2>
<p>You don’t have to ssh into the Dokku host just to execute commands. You can execute them remotely via the <code>dokku</code> user like this:</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb12" style="background: #f1f3f5;"><pre class="sourceCode bash code-with-copy"><code class="sourceCode bash"><span id="cb12-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># https://dokku.com/docs/deployment/application-management/</span></span>
<span id="cb12-2"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">ssh</span> dokku@rechat.co apps:list</span></code></pre></div></div>
</section>
<section id="docker-cache" class="level2">
<h2 class="anchored" data-anchor-id="docker-cache">Docker cache</h2>
<p>This is how you can <a href="https://dokku.com/docs/advanced-usage/repository-management/">invalidate the docker cache</a> for a fresh build:</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb13" style="background: #f1f3f5;"><pre class="sourceCode bash code-with-copy"><code class="sourceCode bash"><span id="cb13-1"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">ssh</span> dokku@rechat.co repo:purge-cache llm-eval</span></code></pre></div></div>
</section>
<section id="rebuild-without-pushing" class="level2">
<h2 class="anchored" data-anchor-id="rebuild-without-pushing">Rebuild without pushing</h2>
<p>Sometimes you want to rebuild without pushing. There are <a href="https://dokku.com/docs/processes/process-management/">many ways to do this</a>, but one way is like this:</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb14" style="background: #f1f3f5;"><pre class="sourceCode bash code-with-copy"><code class="sourceCode bash"><span id="cb14-1"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">ssh</span> dokku@rechat.co ps:rebuild llm-eval</span></code></pre></div></div>
</section>
</section>
<section id="why-did-i-write-this" class="level1">
<h1>Why Did I Write This?</h1>
<p>I had to dig up these details whenever I wanted to deploy a new app, so I had to write it up anyway. I hope you find it useful, too!</p>


</section>

 ]]></description>
  <category>infra</category>
  <category>severless</category>
  <guid>https://hamel.dev/blog/posts/dokku/</guid>
  <pubDate>Tue, 09 Jan 2024 08:00:00 GMT</pubDate>
  <media:content url="https://hamel.dev/blog/posts/dokku/images/serverless.png" medium="image" type="image/png" height="144" width="144"/>
</item>
</channel>
</rss>
