The Product Compass

The Product Compass

AI Product Management

The Ultimate Guide to Fine-Tuning for PMs

Your playbook: when to fine-tune, when to stick with RAG, and how to practice the exact methods top AI teams use (SFT, DPO, ORPO). Step-by-step. No coding.

Paweł Huryn's avatar
Paweł Huryn
Sep 27, 2025
∙ Paid
1
Share

Fine-tuning is one of the most powerful techniques product teams can easily use to adjust the model (e.g., LLM) to their needs.

But it’s easy to confuse it with RAG or context engineering.

What’s the difference? How do you decide which method to use? And how can you practice fine-tuning starting today?

In today’s issue, we discuss:

  1. The Big Picture: What is Fine-Tuning?

  2. Fine-Tuning vs. Reinforcement Learning

  3. Fine-Tuning (Gradient-Based Methods) in Depth

  4. 🔒Step-by-Step: How to Teach an LLM to Do Request Classification Using SFT

  5. 🔒Step-by-Step: How to Make an LLM More Human-Like Using DPO

  6. 🔒How to Choose the Right Fine-Tuning Approach


Hey, it’s Paweł 🙂 Welcome to the Product Compass, the #1 actionable AI PM newsletter. Every week I share research and hands-on guides to help you:

  • Break into the top 10% of AI PMs

  • Learn by building, not just reading

  • Future-proof your career, whatever comes next

In case you missed it, here are some of the latest guides:

  • The Ultimate AI PM Learning Roadmap

  • The Ultimate Guide to AI Agents for PMs

  • A Guide to Context Engineering for PMs

  • The Ultimate Guide to AI Observability and Evaluation Platforms

  • How to Build, Deploy, And Scale Your AI Product Strategy

Consider upgrading your subscription so we can dive deeper together:


1. The Big Picture: What is Fine-Tuning?

At its core, fine-tuning is about teaching a model new habits so it becomes better at a specific task than a general model.

Why should you care as a Product Manager?

Fine-tuning is one of the key levers you have to make a general model fit your product.

Most of the time, prompting and RAG are enough. You just provide the model with the right context. That’s faster, often cheaper, and easier to maintain.

But there are many cases where RAG and prompting hit their limits:

  • You need a consistent style, brand voice, or complex structure.

  • You want to efficiently classify inputs (e.g., inquiry type) at scale.

  • You struggle to eliminate mistakes (failure modes) with prompting.

  • The prompt becomes too long or too complex.

Imagine you’re playing basketball for the first time. Reading a book about how to throw a ball wouldn’t be enough. You need to practice and build muscle memory.

What it Fine-Tuning

In fine-tuning, we train the model using examples. You directly show the model how it should respond and, depending on the specific FT method, what to avoid. This helps the model internalize the desired behavior.

As a result, a smaller, specialized model might outperform a larger and, theoretically, smarter off-the-shelf model.


2. Fine-Tuning vs. Reinforcement Learning

There are two primary ways to adjust the model to our needs: Reinforcement Learning (RL) and Fine-Tuning. We already discussed the latter.

Fine-Tuning vs. Reinforcement Learning (RL)

In Reinforcement Learning (RL), rather than studying examples, the model explores different options, gets feedback, and improves. Well-known examples include training an AI to play chess or Go, where the gains or losses were easy to calculate.

What is Reinforcement Learning (RL)

A special type of RL is Reinforcement Learning with Human Feedback (RLHF), where model responses are assessed by human labelers.

Reinforcement Learning is powerful but also heavyweight and costly. It’s commonly used by AI labs. Fine-tuning is much lighter and cheaper. It’s the primary lever for product teams.


3. Fine-Tuning (Gradient-Based Methods) in Depth

Fine-Tuning (Gradient-Based Methods)

Share

As discussed previously, fine-tuning is about training the model using examples. We typically use JSONL files that contain one training example in each line.

Let’s discuss the common fine-tuning methods together with example JSONL files.

Group 1: Supervised Fine-Tuning (SFT)

SFT is the most common approach to fine-tuning. The model learns how to imitate good examples.

It works best when you have one correct answer per input. One perfect use case is text classification.

An example JSONL file:

An example JSONL file for Supervised Fine-Tuning (SFT)
An example JSONL file for Supervised Fine-Tuning (SFT)

Note: I wrapped lines visually, but a real JSONL file would have two lines.

SFT is also commonly used as a first step before applying other fine-tuning methods.

Group 2: Preference-based Fine-Tuning

Preference-based fine-tuning is a group of methods, where instead of just imitating good examples, the model learns what to prefer.

There are three major approaches:

Approach 1: Direct Preference Optimization (DPO)

DPO is the second most common fine-tuning method. It compares your fine-tuned model to a frozen copy of the original model (“reference model”) to prevent drift.

To perform DPO, we must create a dataset of examples and for each define:

  • prompt (input),

  • chosen (preferred) output,

  • rejected (non-preferred) output.

An example JSONL file in the Hugging Face format:

An example JSONL file for DPO (Direct Preference Optimization), Hugging Face format
An example JSONL file for DPO, Hugging Face format

OpenAI requires a slightly more complex structure (even GPT-5 makes mistakes here):

  • input → a dictionary with messages (system + user).

  • preferred_output → an array of assistant messages.

  • non_preferred_output → an array of assistant messages.

An example JSONL file for DPO (Direct Preference Optimization), OpenAI format
An example JSONL file for DPO, OpenAI format

Approach 2: Odds Ratio Preference Optimization (ORPO)

ORPO is lighter and cheaper, as no reference model is required. As a consequence, it’s also less stable, as the model might drift.

It uses the same format and structure as DPO:

  • prompt (input),

  • chosen (preferred) output,

  • rejected (non-preferred) output.

Just to visualize, an example JSONL file in the Hugging Face format:

An example JSONL file for ORPO (Odds Ratio Preference Optimization), Hugging Face format
An example JSONL file for ORPO, Hugging Face format

Approach 3: Kahneman-Tversky Optimization (KTO)

KTO is a fine-tuning method inspired by behavioral economics (in particular, risk aversion). It punishes bad outputs more than it rewards good ones.

It’s useful when your feedback signals are noisy or ambiguous (for example, clicks, likes, or upvotes that don’t always reflect true quality).

But there’s a trade-off: models trained with KTO often converge toward “safe, average answers.” To get real value, you need a wide range of positive examples so the model doesn’t just play it safe.

An example JSONL for KTO (Kahneman-Tversky Optimization), Hugging Face format
An example JSONL file for KTO (Kahneman-Tversky Optimization), Hugging Face format

In the screenshot above:

  • label: 1 means the response is preferred/positive

  • label: 0 means not preferred/negative

That might resemble datasets for DPO/ORPO, but we could use other values, too:

  • If the model is configured for classification, it expects integers.

  • If it’s configured for regression (predicting continuous values), you can use floats (like 0.73 or 4.2).

I haven’t tested KTO. It’s good to know so you know where to start in case you need it, but the details go beyond the scope of this post.

Fine-Tuning Implementation Styles

Regardless of the method (SFT, DPO, ORPO, KTO), there are two main ways to implement fine-tuning:

Fine-Tuning Implementation Styles, PEFT (LoRA, QLoRA)

Way 1: Full Fine-Tuning

Updates all original weights. Very flexible, but slow, expensive, and storage-heavy. Mostly used in research, not product teams.

Way 2: Parameter-Efficient Fine-Tuning (PEFT)

Adds small adapters (LoRA/QLoRA) instead of changing everything. It’s much cheaper, faster, and lighter to store, though less expressive for edge cases.

In practice, PEFT is what most teams use today. It gives you 80-90% of the benefits of full fine-tuning at a fraction of the cost.


Next, only for the premium subscribers, we discuss:

  1. 🔒Step-by-Step: How to Teach an LLM Request Classification Using SFT

  2. 🔒Step-by-Step: How to Make an LLM More Human-Like With DPO

  3. 🔒How to Choose the Right Fine-Tuning Approach


4. Step-by-Step: How to Teach an LLM to Do Request Classification Using SFT

Step 1: Prepare a dataset

Request classification seems like an easy task, but based on my research, most models get 65-75% accuracy even with a few -hot examples (80-85%) due to edge cases.

At the same time, according to my research, a fine-tuned model should reach 90%-95% accuracy pretty reliably.

Let’s test if that’s the case.

Keep reading with a 7-day free trial

Subscribe to The Product Compass to keep reading this post and get 7 days of free access to the full post archives.

Already a paid subscriber? Sign in
© 2025 Paweł Huryn
Privacy ∙ Terms ∙ Collection notice
Start writingGet the app
Substack is the home for great culture