Why AI evaluations (evals) are essential for product managers
With new models being rolled out constantly and more features powered by AI it became a real question how to evaluate new models performance for the tasks at hand. I’ve stumbled on a good YouTube video with experienced product manager who breaks down how to approach evals. Here is a condensed summary and I recommend to check out the original video. Link at the end of the post.
Condensed Summary
This discussion highlights why AI evaluations (evals) are essential for product managers building AI-driven features. It covers four eval types—code-based, human, LLM-as-judge, and user—and walks through a real-world example: creating a customer support chatbot for “On” running shoes. Key steps include drafting prompts with policy and product context, building a human-labeled golden dataset with rubrics, iteratively refining prompts, automating evals with LLMs, measuring match rates, and planning a scaled rollout with A/B tests and user feedback.
Detailed Bullet Points
📊 Types of AI Evaluations
• code-based eval: pass/fail checks via scripts (e.g., blocking competitor names)
• human eval: subject matter experts or PMs grade responses (thumbs up/down)
• LLM-as-judge eval: have an LLM label outputs at scale like a human judge
• user eval: gather downstream customer feedback as a business metric
🔧 Practical Example: “On” Running Shoes Agent
• goal: build a customer support bot for On running shoes on their website
• platform: use Anthropic Workbench to generate and refine the system prompt
• context inputs: user question, product information (shoe names, specs), return policy
🛠 Prompt Development & Iteration
• initial prompt: describe role, include variables for question, product info, policy
• use generate feature to auto-format placeholders and example dialogues
• test with a real question: “How do I return my On Cloud shoes after two months?”
• critique output: tone too formal, missing next-step details, may lack clarity on contacting support
📄 Creating a Human-Labeled Golden Dataset
• define evaluation rubric columns (product knowledge, policy compliance, tone)
• set criteria for good/average/bad in each dimension
• populate spreadsheet with ~5–10 example interactions to start
• debate labels among team members to align definitions and edge cases
• document notes alongside bad examples (e.g., LLM math errors on time windows)
🔄 Iterative Evaluation Process
• refine prompts based on rubric insights (e.g., inject friendliness, concise tone)
• use few-shot examples or system-level instructions to adjust style and rules
• test revised prompts on initial examples, re-grade against rubric
• repeat iteration loops until examples meet quality thresholds
🤖 Automating with LLM-as-Judge & Match Rate
• upload golden dataset into an evaluation tool (e.g., Arise platform)
• configure eval prompts that mirror rubric criteria for automated labeling
• add explanation requests so the LLM justifies each rating
• run multiple examples and models (Claude, GPT) to compare outputs
• calculate match rate: percentage of LLM labels matching human labels
• identify dimensions with low match rates (e.g., tone agreed only 1/5 times)
🚀 Scaling & Deployment Strategy
• internal testing phase: start with ~10 labeled examples for rapid iteration
• pre-production eval: expand to ~100 examples to boost statistical confidence
• A/B testing: route small % of live traffic through the bot (1–10%)
• dogfood: have employees use the bot to spot issues before wider launch
• user eval considerations: collect thumbs-up/down but interpret noise carefully—tie-break with deeper review
📝 Conclusions
• AI evals are foundational for delivering reliable AI products and mitigating hallucinations
• Four eval types cover technical, human, automated, and real-user feedback
• A structured loop—prompt draft, human labeling, rubric debate, automated eval, prompt refinement—drives quality
• Matching LLM-as-judge labels with human judgments ensures scalable, trustworthy assessments
• Phased rollout (internal → A/B → full) and continuous feedback help maintain performance and user satisfaction
Link to the video:
https://youtu.be/TL527yTpxlk?si=zR3CStdtZWQ5xsiZ