AI Evals: How to Find The Right AI Product Metrics

Generic metrics, such as “hallucination” or “toxicity,” often miss domain-specific issues. Here's what the best AI teams track instead and how they leverage error analysis.

Jun 07, 2025

∙ Paid

Hey, Paweł here. Welcome to the ~~free~~ premium edition of The Product Compass Newsletter.

With 114,360+ PMs from companies like Meta, Amazon, Google, and Apple, this newsletter is the #1 most practical resource to learn and grow as an AI PM.

Here’s what you might have recently missed:

Consider subscribing and upgrading your account for the full experience:

85% of AI initiatives fail (Gartner, 2024). I've been researching the topic for 3+ months, trying to understand how we can prevent that.

Experts like

Hamel Husain

, Shreya Shankar, Andrew Ng, and teams from Google’s PAIR emphasize that the teams that succeed obsess over analyzing, measuring, improving in quick cycles.

While the experimentation mindset is core to product management, when working with Gen AI, it becomes even more critical than in traditional software:

But there is a problem.

Metrics promoted by eval vendors, like "hallucination," “helpfulness,” or "toxicity," are ineffective and too often miss domain-specific issues.

It turns out that the AI teams that succeed take a completely different approach. Rather than starting from the top (“let’s think about the metrics”), they:

Look at actual data (LLM traces)
Identify failure modes
Let app-specific metrics emerge bottom-up

And as we discussed in WTF is an AI Product Manager, evaluating AI is one of the AI PM’s core responsibilities.

So, in this issue, we discuss:

How to Perform Error Analysis
How to Turn Failure Modes Into App-Specific AI Metrics
How to Evaluate the Evaluators: TPR/Recall, TNR, Precision, F1-Score
LLM Prompts That Support Error Analysis

Let’s dive in.

1. How to Perform Error Analysis

Error analysis is the highest ROI AI product development activity.

The process is straightforward. You look at the data, label LLM logs, and classify the errors into failure modes. We repeat it until no new significant failure modes emerge:

In case you wonder, LLM traces are just full records of the LLM pipeline execution: user query, reasoning, tool calls, and the output.

For example (simplified):

As a rule of thumb, before going further, you need ~100 high-quality, diverse traces. Those can be real data, synthetic data, or both coded with failure modes.

Now, let’s discuss four steps of the Error Analysis Cycle in detail.

Step 1: (Optional) Generate Synthetic Traces

If you have data from production, that’s great. But often, when starting AI product development, there is no data you can rely on.

Here comes synthetic data generation.

Warning: Don't generate synthetic data without hypotheses about where AI might fail. You can build intuition by using the product. Involve domain experts, especially in complex domains.

Very complex domains aside, as an AI PM, you should become a domain expert too. What’s key, it’s not an engineering task.

Prerequisite: Start with defining at least 3 dimensions that represent where the app is likely to fail (your hypotheses)
Generate Tuples: You need 10-20 random combinations of those dimensions
Human Review: Remove duplicates and unrealistic combinations
Generate Queries: Generate a natural language query for each tuple
Human Review: Discard awkward or unrealistic queries

An example for a finance chatbot:

Finally, we run these synthetic queries through our LLM pipeline to generate traces.

In Chapter 4, I’ve shared LLM prompts to:

Prompt 1: Generate Synthetic Tuples
Prompt 2: Generate Synthetic Queries

Before we proceed, I recommend AI Evals For Engineers & PMs course:

I've participated in the first cohort together with 700+ AI engineers and PMs. And I agree with Teresa Torres:

An extra cohort starts on Oct 6.
A special $1,050 discount for our community:

Get $1,050 discount

Step 2: Read and Open Code Traces

The next step is using the Open Coding technique known from qualitative research.

For every LLM trace, write brief, descriptive notes with problems, surprises, and incorrect behaviors. At this stage, this data is messy and unstructured.

For example:

Error Analysis: Read and Open Code Traces — Example: Open coding LLM traces

Step 3: Axial Coding, Refine Failure Modes

Next, we want to identify patterns. Cluster similar notes and let failure modes (error categories) naturally emerge.

For example:

As Shraya Shankar and

Hamel Husain

notice in their upcoming book, Application-Centric AI Evals for Engineers and Technical PMs:

“Axial coding requires careful judgment. When in doubt, consult a domain expert. The goal is to define a small, coherent, non-overlapping set of binary failure types, each easy to recognize and apply consistently during trace annotation”

You can automate the process with an LLM. In that case, always review its output.

In Chapter 4, I’ve shared an LLM prompt that will help you Refine Failure Taxonomy.

Step 4: Re-Code Traces With Failure Modes

Go back and re-code LLM traces with new failure modes. For example:

Error Analysis: Re-Code Traces With Failure Modes — Example: Coding LLM traces with failure modes

Next, quantify failure modes. For example:

Error Analysis: quantify failure modes — Example: Quantified failure modes

With each new iteration and more traces, you'll refine definitions and merge or split categories.

Repeat the process (Steps 1-4) until no new failure modes & no changes in re-coding appear. We call this state theoretical saturation.

Keep reading with a 7-day free trial

Subscribe to The Product Compass to keep reading this post and get 7 days of free access to the full post archives.

The Product Compass