Tame Complexity By Scoping LLM Evals

LLMs
evals
Office hours discussion on managing evaluation complexity through proper scoping
Published

December 29, 2024

Note

These are notes from my open office hours on LLM Evals, where I troubleshoot real issues companies are having with their evals. Each session is 20 minutes.

I spoke with Maggie from Sunday, a lawn and garden startup that offers personalized lawn care subscriptions. They’ve built a chatbot that helps customers with everything from product recommendations to subscription questions. Their experience highlighted a common challenge: how do you effectively evaluate an LLM application with broad scope?

Watch The Discussion

For those interested in the full context, here’s our complete 20-minute conversation:

The Challenge: Broad Surface Area

The team had already made significant progress with their evaluation approach:

  • Built a Python Shiny app for manual evaluation
  • Implemented an LLM-as-judge system
  • Achieved 80% alignment between human and LLM judgments
  • Categorized conversations into roughly 40 distinct topics
  • Created detailed critiques for different response types

However, they were struggling with consistency in their evaluations. “One in five are wrong,” Maggie noted. “… I worry about just letting that run in an automated way.”

Topic Distribution and Seasonal Patterns

A deeper look at their usage patterns revealed some important insights:

  1. 5-6 topics drove most conversations in any given season
  2. Topics shifted seasonally (e.g., fall questions about frost vs. spring questions about timing)
  3. While specific concerns changed with seasons, the underlying topics remained similar
  4. Example top topics for fall included:
    • Seeding timing
    • Renewal dates
    • Next year’s plan
    • Subscription questions
    • Weed management

Rather than trying to perfect evaluation across all 40 topics, we discussed several approaches:

1. Focus on High-Traffic Topics

Instead of trying to excel at everything, focus evaluation efforts on the 5-6 topics that drive most conversations. This doesn’t mean abandoning other topics, but rather acknowledging that some areas will be more polished than others.

2. Segment Evaluation by Topic Type

Some topics showed better alignment between human and LLM judgments than others. For example, verification questions performed well because they had clear information in their knowledge base. Shipping questions were more problematic due to complex data formatting.

3. Consider Synthetic Data for Seasonal Patterns

For seasonal variations, Maggie realized they could generate synthetic data: “They’re going to ask different questions in the fall about seeding, timing, it being too late, or what about frost? But they’re going to ask the same… they’re still going to ask about seeding timing in the spring. It might just be, is it too early or is it too hot?”

On Automation vs. Manual Review

One key question was about scaling evaluations: “If there’s thousands of conversations happening, can’t possibly read them all… How do you ensure quality?”

The reality is that you can’t completely automate away the need to look at data. However, you can be strategic: - Sample more heavily from areas with low judge alignment - Use specialized tests for specific failure modes - Leverage user feedback (they saw 8-10% feedback rate) - Focus manual review on the most important topics

Key Takeaways

  1. Start narrow and expand: Perfect your approach on a few key topics before trying to handle everything
  2. Don’t expect perfect automation: Manual review will always play some role
  3. Be realistic about evaluation metrics: 80% alignment might be better than it sounds
  4. Consider your real requirements: Not every topic needs the same level of polish

This pragmatic approach allows teams to make real progress while acknowledging the inherent challenges of building broad-scope LLM applications.