Evaluating AI Products: How to Find The Right Metrics
Generic metrics, such as “hallucination” or “toxicity,” often miss domain-specific issues. Here's what the best AI teams track instead and how they leverage error analysis.
Hey, Paweł here. Welcome to the premium edition of The Product Compass Newsletter.
With 114,360+ PMs from companies like Meta, Amazon, Google, and Apple, this newsletter is the #1 most practical resource to learn and grow as an AI PM.
Here’s what you might have recently missed:
AI Agent Architectures: The Ultimate Guide With n8n Examples
Beyond Vibe Coding: No-Code B2C SaaS Template With Stripe Payments
Consider subscribing and upgrading your account for the full experience:
85% of AI initiatives fail (Gartner, 2024). I've been researching the topic for 3+ months, trying to understand how we can prevent that.
Experts like
, Shreya Shankar, Andrew Ng, and teams from Google’s PAIR emphasize that the teams that succeed obsess over analyzing, measuring, improving in quick cycles.While the experimentation mindset is core to product management, when working with Gen AI, it becomes even more critical than in traditional software:
But there is a problem.
Metrics promoted by eval vendors, like "hallucination," “helpfulness,” or "toxicity," are ineffective and too often miss domain-specific issues.
It turns out that the AI teams that succeed take a completely different approach. Rather than starting from the top (“let’s think about the metrics”), they:
Look at actual data (LLM traces)
Identify failure modes
Let app-specific metrics emerge bottom-up
And as we discussed in WTF is an AI Product Manager, evaluating AI is one of the AI PM’s core responsibilities.
So, in this issue, we discuss:
How to Perform Error Analysis
How to Turn Failure Modes Into App-Specific AI Metrics
🔒How to Evaluate the Evaluators: TPR/Recall, TNR, Precision, F1-Score
🔒LLM Prompts That Support Error Analysis
Let’s dive in.
1. How to Perform Error Analysis
Error analysis is the highest ROI AI product development activity.
The process is straightforward. You look at the data, label LLM logs, and classify the errors into failure modes. We repeat it until no new significant failure modes emerge:
In case you wonder, LLM traces are just full records of the LLM pipeline execution: user query, reasoning, tool calls, and the output.
For example (simplified):
As a rule of thumb, before going further, you need ~100 high-quality, diverse traces. Those can be real data, synthetic data, or both coded with failure modes.
Now, let’s discuss four steps of the Error Analysis Cycle in detail.
Step 1: (Optional) Generate Synthetic Traces
If you have data from production, that’s great. But often, when starting AI product development, there is no data you can rely on.
Here comes synthetic data generation.
Warning: Don't generate synthetic data without hypotheses about where AI might fail. You can build intuition by using the product. Involve domain experts, especially in complex domains.
Very complex domains aside, as an AI PM, you should become a domain expert too. What’s key, it’s not an engineering task.
Next:
Prerequisite: Start with defining at least 3 dimensions that represent where the app is likely to fail (your hypotheses)
Generate Tuples: You need 10-20 random combinations of those dimensions
Human Review: Remove duplicates and unrealistic combinations
Generate Queries: Generate a natural language query for each tuple
Human Review: Discard awkward or unrealistic queries
An example for a finance chatbot:
Finally, we run these synthetic queries through our LLM pipeline to generate traces.
In Chapter 4, I’ve shared LLM prompts to:
Prompt 1: Generate Synthetic Tuples
Prompt 2: Generate Synthetic Queries
Before we continue, I recommend "AI Evals For Engineers & PMs." It's a cohort-based course by Hamel Hussain and Shreya Shankar.
I'm participating in the first edition alongside ~700 other students and getting a ton out of it so far. They go deep into AI evals without unnecessary jargon and engage in community discussions. Our homework assignments are challenging, but achievable.
The next and the last live cohort starts on July 21. A special $800 discount for my community:
Step 2: Read and Open Code Traces
The next step is using the Open Coding technique known from qualitative research.
For every LLM trace, write brief, descriptive notes with problems, surprises, and incorrect behaviors. At this stage, this data is messy and unstructured.
For example:
Step 3: Axial Coding, Refine Failure Modes
Next, we want to identify patterns. Cluster similar notes and let failure modes (error categories) naturally emerge.
For example:
As Shraya Shankar and
notice in their upcoming book, Application-Centric AI Evals for Engineers and Technical PMs:“Axial coding requires careful judgment. When in doubt, consult a domain expert. The goal is to define a small, coherent, non-overlapping set of binary failure types, each easy to recognize and apply consistently during trace annotation”
You can automate the process with an LLM. In that case, always review its output.
In Chapter 4, I’ve shared an LLM prompt that will help you Refine Failure Taxonomy.
Step 4: Re-Code Traces With Failure Modes
Go back and re-code LLM traces with new failure modes. For example:
Next, quantify failure modes. For example:
With each new iteration and more traces, you'll refine definitions and merge or split categories.
Repeat the process (Steps 1-4) until no new failure modes & no changes in re-coding appear. We call this state theoretical saturation.
2. How to Turn Failure Modes Into App-Specific AI Metrics
Once you perform initial error analysis, you can define automated evaluators.
A good practice is that each evaluator tackles a single failure mode, evaluating a single metric.
Step 1: Start With Analyzing Failure Type
There are two types of failure types we need to consider:
Specification Failure:
Condition: Your instructions were unclear or incomplete.
Action: Fix the prompt first. Don't build an evaluator yet.
Generalization Failure:
Condition: LLM fails to apply clear, precise instructions correctly.
Action: These are prime candidates for automated evaluators.
For example:
Step 2: Consider Two Types of App-Specific Evaluators
There are two types of automatic evaluators to consider:
Code-Based Evals
They are based on the logic AI engineers write (e.g., Python script)
They evaluate objective, rule-based checks such as XML, SQL, Regex
They are fast, cheap, objective, and deterministic
LLM-as-Judge Evals
This type of evaluator uses another LLM as a judge
They are perfect for complex or subjective checks
A single, narrow failure mode
Start with binary checks (fail/pass) - this radically simplifies the setup and eliminates problems with building alignment between human experts
Importantly, using LLM-as-Judge evals might involve significant cost. What to track and what to ignore is a product, not just engineering decision. Use common sense (#errors, impact) and consider tradeoffs.
Each automatic evaluator targets a single failure mode. And that's how you get app-specific AI metrics you were looking for.
But that’s not all.
How do we know we can trust our Judges?
🔒 3. How to Evaluate the Evaluators: TPR/Recall, TNR, Precision, F1-Score
Keep reading with a 7-day free trial
Subscribe to The Product Compass to keep reading this post and get 7 days of free access to the full post archives.