Why AI evaluations (evals) are essential for product managers

Sep 07, 2025

With new models being rolled out constantly and more features powered by AI it became a real question how to evaluate new models performance for the tasks at hand. I’ve stumbled on a good YouTube video with experienced product manager who breaks down how to approach evals. Here is a condensed summary and I recommend to check out the original video. Link at the end of the post.

Condensed Summary

This discussion highlights why AI evaluations (evals) are essential for product managers building AI-driven features. It covers four eval types—code-based, human, LLM-as-judge, and user—and walks through a real-world example: creating a customer support chatbot for “On” running shoes. Key steps include drafting prompts with policy and product context, building a human-labeled golden dataset with rubrics, iteratively refining prompts, automating evals with LLMs, measuring match rates, and planning a scaled rollout with A/B tests and user feedback.

Detailed Bullet Points

📊 Types of AI Evaluations

• code-based eval: pass/fail checks via scripts (e.g., blocking competitor names)

• human eval: subject matter experts or PMs grade responses (thumbs up/down)

• LLM-as-judge eval: have an LLM label outputs at scale like a human judge

• user eval: gather downstream customer feedback as a business metric

🔧 Practical Example: “On” Running Shoes Agent

• goal: build a customer support bot for On running shoes on their website

• platform: use Anthropic Workbench to generate and refine the system prompt

• context inputs: user question, product information (shoe names, specs), return policy

🛠 Prompt Development & Iteration

• initial prompt: describe role, include variables for question, product info, policy

• use generate feature to auto-format placeholders and example dialogues

• test with a real question: “How do I return my On Cloud shoes after two months?”

• critique output: tone too formal, missing next-step details, may lack clarity on contacting support

📄 Creating a Human-Labeled Golden Dataset

• define evaluation rubric columns (product knowledge, policy compliance, tone)

• set criteria for good/average/bad in each dimension

• populate spreadsheet with ~5–10 example interactions to start

• debate labels among team members to align definitions and edge cases

• document notes alongside bad examples (e.g., LLM math errors on time windows)

🔄 Iterative Evaluation Process

• refine prompts based on rubric insights (e.g., inject friendliness, concise tone)

• use few-shot examples or system-level instructions to adjust style and rules

• test revised prompts on initial examples, re-grade against rubric

• repeat iteration loops until examples meet quality thresholds

🤖 Automating with LLM-as-Judge & Match Rate

• upload golden dataset into an evaluation tool (e.g., Arise platform)

• configure eval prompts that mirror rubric criteria for automated labeling

• add explanation requests so the LLM justifies each rating

• run multiple examples and models (Claude, GPT) to compare outputs

• calculate match rate: percentage of LLM labels matching human labels

• identify dimensions with low match rates (e.g., tone agreed only 1/5 times)

🚀 Scaling & Deployment Strategy

• internal testing phase: start with ~10 labeled examples for rapid iteration

• pre-production eval: expand to ~100 examples to boost statistical confidence

• A/B testing: route small % of live traffic through the bot (1–10%)

• dogfood: have employees use the bot to spot issues before wider launch

• user eval considerations: collect thumbs-up/down but interpret noise carefully—tie-break with deeper review

📝 Conclusions

• AI evals are foundational for delivering reliable AI products and mitigating hallucinations

• Four eval types cover technical, human, automated, and real-user feedback

• A structured loop—prompt draft, human labeling, rubric debate, automated eval, prompt refinement—drives quality

• Matching LLM-as-judge labels with human judgments ensures scalable, trustworthy assessments

• Phased rollout (internal → A/B → full) and continuous feedback help maintain performance and user satisfaction

Link to the video:

https://youtu.be/TL527yTpxlk?si=zR3CStdtZWQ5xsiZ

Michael’s Substack

Discussion about this post