Multi-Turn Chat Evals
These are notes from my open office hours on LLM Evals, where I troubleshoot real issues companies are having with their evals. Each session is 20 minutes.
I spoke with Max from Windmill (see video below) about a common challenge many teams face: evaluating multi-turn chat conversations. Their team had built an AI assistant named Windy that helps managers collect peer feedback and track team focus areas through Slack conversations. While they had success improving prompts manually at first, they reached a point where they needed more robust evaluation approaches.
Watch The Office Hours
Here is the video of the full discussion (20 minutes):
(Sorry about the video quality. I’ll try to fix that in the future.)
This conversation highlighted several key patterns I’ve seen teams struggle with when evaluating conversational AI. Let’s break down how to get unstuck.
Common Pitfalls
When teams first approach evaluating multi-turn conversations, they often:
- Try to evaluate everything at once
- Get overwhelmed by the subjective nature of conversations
- Jump straight to automated solutions before understanding what “good” looks like
- Struggle to define clear success criteria
As Max put it during our discussion: “You get into that whack-a-mole game where you fix one thing and then other stuff gets worse.”
Start with Error Analysis
The counterintuitive first step is not to build an evaluation system, but to do error analysis. In Max’s case, they had already started this process by analyzing cases where users dismissed the chat interaction with “I don’t have any feedback” responses.
This is exactly the right approach. Before building complex evaluation frameworks, you need to:
- Collect real examples of conversations
- Categorize what’s going wrong
- Look for patterns in the failures
As I mentioned during our discussion: “There’s no linear workflow through this. If you find something that’s obviously really wrong all the time, just go fix it. Don’t get nerd sniped by evals in the extreme sense.”
Focus on Binary Decisions
One of the key insights from our conversation was the importance of making binary decisions rather than using complex scoring systems. When evaluating conversations, you want to ask: “Did this conversation achieve its intended outcome?”
Do Not Do This:
- “Rate this conversation on a scale of 1-5 for clarity”
- “Score these 12 different aspects of the interaction”
- “Evaluate the conversation across multiple dimensions”
The reason is simple: Binary decisions force you to be clear about what success looks like. They also make it easier to:
- Identify clear patterns in failures
- Make actionable improvements
- Agree on what constitutes a good interaction
Building Your Evaluation Process
Based on the challenges Max’s team faced, here’s a step-by-step process for getting started with multi-turn chat evaluations:
- Collect Example Conversations
- Gather real user interactions
- Include the full context and metadata
- Sample across different user types and scenarios
- Identify Clear Success Criteria
- What is the intended outcome of each conversation?
- What makes a conversation successful from the user’s perspective?
- What are the minimum requirements for a “pass”?
- Perform Error Analysis
- Review conversations with domain experts
- Make binary pass/fail decisions
- Write detailed explanations for each decision
- Fix Obvious Issues First
- Address clear patterns of failure
- Implement simple fixes before building complex evaluation systems
- Validate improvements with more manual review
- Then Consider Automation
- Build LLM-based evaluation only for well-understood patterns
- Prioritize errors uncovered through error analysis
- Validate judge agreements with domain experts so that you can trust the results
A Real Example
Let’s look at a simplified example from Max’s context. Here’s how you might evaluate a peer feedback conversation:
Windy: "I noticed you worked with Sarah on the Q4 planning doc. How was that collaboration?"
User: "I don't have any feedback on that."
Windy: "No problem! Let me know if you have feedback later."
Pass/Fail Decision: Fail
Reasoning: The conversation failed because: 1. The timing or context was wrong (user didn’t have meaningful feedback to share) 2. The assistant didn’t attempt to understand why feedback wasn’t available 3. No value was created from the interaction
This is much more actionable than scoring various aspects of the conversation on a 1-5 scale.
When to Scale Up
Only after you have: 1. Clear patterns of what makes conversations successful 2. Documented examples of good and bad interactions 3. Specific criteria for pass/fail decisions
Should you consider building more sophisticated evaluation systems. As Max’s team discovered, trying to automate evaluations before these fundamentals are in place leads to confusion and wasted effort.
Key Takeaways
- Start with manual error analysis before building automated evaluations
- Use binary pass/fail decisions to force clarity about success criteria
- Fix obvious issues before building complex evaluation systems
- Document your reasoning about why conversations succeed or fail
- Build automation only after you have clear patterns and criteria
Remember, the goal isn’t to build a perfect evaluation system. The goal is to consistently improve the quality of your AI. Looking at your data carefully is an important prerequisite prior to setting up an evaluation system.
Resources
These resources can help you learn more about evaluating conversational AI:
- Your AI Product Needs Evals: A broader overview of evaluation approaches for AI products
- Creating a LLM-as-Judge That Drives Business Results: Detailed guidance on building LLM-based evaluation systems
- Who Validates the Validators?: Research on aligning LLM evaluations with human preferences
If you’d like to find out about future office hours, you can subscribe to my newsletter.