The Product Compass

The Product Compass

Share this post

The Product Compass
The Product Compass
Mastering AI Evals: A Complete Guide for PMs
AI Product Management

Mastering AI Evals: A Complete Guide for PMs

AI Evals are emerging as the top skill for AI PMs. Best practices from 30+ companies and ready-to-use AI Eval templates.

Paweł Huryn's avatar
Hamel Husain's avatar
Paweł Huryn
and
Hamel Husain
Apr 26, 2025
∙ Paid
140

Share this post

The Product Compass
The Product Compass
Mastering AI Evals: A Complete Guide for PMs
16
Share

Hey, Paweł here. Welcome to the free archived edition of The Product Compass Newsletter.

With 107,800+ PMs from companies like Meta, Amazon, Google, and Apple, this newsletter is the #1 source for learning and growth as an AI PM.

Consider subscribing and upgrading your account for the full experience:


Recently, subscribers kept asking me about AI Evals. Many say it's the most critical element of any AI initiative. Evals are emerging as the top skill for AI PMs.

As Garry Tan, CEO of Y Combinator, said:

Evals are emerging as the real moat for AI startups

Today’s guest is

Hamel Husain
, a recognized ML expert with 20 years of experience. He’s worked with companies like Airbnb and GitHub where he led early LLM research used by OpenAI for code understanding.

Hamel has also led and contributed to numerous popular open-source machine-learning tools.

Currently, he works as an independent consultant helping companies improve AI products through evals. Hamel is widely recognized for his expertise on evals and has unique perspectives on the topic.

In today’s issue, we discuss:

  1. Why Do We Need AI Evals

  2. AI Evals Flywheel: Virtuous Cycle

  3. Three Levels of AI Evaluation

  4. AI Eval Metrics: Bottom-Up vs. Top-Down Analysis

  5. Three Free Superpowers Eval Systems Unlock

  6. AI Eval Templates to Download

  7. Conclusion


Before we proceed, I’d like to recommend AI Evals For Engineers & PMs cohort-based course:

AI Evals For Engineers & PMs

I've participated in the first cohort together with 700+ AI engineers and PMs. I have no doubt that evals are something PMs should think seriously about. I agree with Teresa Torres:

Source: LinkedIn

Missed it? The next cohort: July 21 - Aug 16, 2025

A special $945 discount for our community:

Get a $945 discount


1. Why Do We Need AI Evals

Hey, Hamel here. I started working with language models five years ago when I led the team that created CodeSearchNet, a precursor to GitHub CoPilot.

Since then, I’ve seen many successful and unsuccessful approaches to building LLM products. I’ve found that unsuccessful products almost always share a common root cause: a failure to create robust evaluation systems.

Here’s a common scene from my consulting work:

Why do We Need AI Evals

This scene has played out dozens of times over the last two years. Teams invest weeks building complex AI systems, but can’t tell me if their changes are helping or hurting.

This isn’t surprising. With new tools and frameworks emerging weekly, it’s natural to focus on tangible things we can control: which vector database to use, which LLM provider to choose, or which agent framework to adopt.

But after helping 30+ companies build AI products, I’ve discovered the teams who succeed barely talk about tools at all. Instead, they obsess over measurement and iteration.

In the next point, we discuss a case-study (one of my clients) where evals dramatically improved the AI product.


2. AI Evals Flywheel: Virtuous Cycle

Like software engineering, success with AI hinges on how fast you can iterate.

You must have processes and tools for:

  1. Evaluating quality (e.g., tests)

  2. Debugging issues (e.g., logging & inspecting data)

  3. Changing the product’s behavior (e.g., prompt engineering, fine-tuning, coding)

Doing all three activities well creates a virtuous cycle differentiating great from mediocre AI products.

If you streamline your evaluation process, all other activities become easy.

This is very similar to how tests in software engineering pay massive dividends in the long term despite requiring up-front investment.

To ground this post in a real-world situation, I’ll walk through a case study in which we built a system for rapid improvement.

Case Study: Lucy, A Real Estate AI Assistant

Rechat is a SaaS application that allows real estate professionals to perform various tasks, such as managing contracts, searching for listings, building creative assets, managing appointments, and more in one place.

Rechat’s AI assistant, Lucy, is a canonical AI product: a conversational interface that obviates the need to click, type, and navigate the software.

During Lucy’s beginning stages, rapid progress was made with prompt engineering. But as Lucy’s surface area expanded, its performance plateaued:

  • Addressing one failure mode led to the emergence of others

  • There was limited visibility into the AI system’s effectiveness across tasks

  • Prompts expanded into long and unwieldy forms, attempting to cover numerous edge cases and examples

We faced a problem: How to systematically improve the AI?

To break through this plateau, we created a systematic approach to improving Lucy focused on evaluation:

AI Evals Flywheel: Virtuous Cycle
This diagram is a best-faith effort to illustrate my mental model for improving AI systems. In reality, the process might be different.

Rigorous and systematic evaluation is the most important part of the whole system. You should spend most of your time making your evaluation more robust and streamlined.

I refer to components of this system in the next point.


3. Three Levels of AI Evaluation

Keep reading with a 7-day free trial

Subscribe to The Product Compass to keep reading this post and get 7 days of free access to the full post archives.

Already a paid subscriber? Sign in
A guest post by
Hamel Husain
I am a machine learning engineer with over 20 years of experience. More about me @ https://hamel.dev
Subscribe to Hamel
© 2025 Paweł Huryn
Privacy ∙ Terms ∙ Collection notice
Start writingGet the app
Substack is the home for great culture

Share