The Ultimate Guide to Fine-Tuning for PMs
Your playbook: when to fine-tune, when to stick with RAG, and how to practice the exact methods top AI teams use (SFT, DPO, ORPO). Step-by-step. No coding.
Fine-tuning is one of the most powerful techniques product teams can easily use to adjust the model (e.g., LLM) to their needs.
But it’s easy to confuse it with RAG or context engineering.
What’s the difference? How do you decide which method to use? And how can you practice fine-tuning starting today?
In today’s issue, we discuss:
The Big Picture: What is Fine-Tuning?
Fine-Tuning vs. Reinforcement Learning
Fine-Tuning (Gradient-Based Methods) in Depth
🔒Step-by-Step: How to Teach an LLM to Do Request Classification Using SFT
🔒Step-by-Step: How to Make an LLM More Human-Like Using DPO
🔒How to Choose the Right Fine-Tuning Approach
Hey, it’s Paweł 🙂 Welcome to the Product Compass, the #1 actionable AI PM newsletter. Every week I share research and hands-on guides to help you:
Break into the top 10% of AI PMs
Learn by building, not just reading
Future-proof your career, whatever comes next
In case you missed it, here are some of the latest guides:
Consider upgrading your subscription so we can dive deeper together:
1. The Big Picture: What is Fine-Tuning?
At its core, fine-tuning is about teaching a model new habits so it becomes better at a specific task than a general model.
Why should you care as a Product Manager?
Fine-tuning is one of the key levers you have to make a general model fit your product.
Most of the time, prompting and RAG are enough. You just provide the model with the right context. That’s faster, often cheaper, and easier to maintain.
But there are many cases where RAG and prompting hit their limits:
You need a consistent style, brand voice, or complex structure.
You want to efficiently classify inputs (e.g., inquiry type) at scale.
You struggle to eliminate mistakes (failure modes) with prompting.
The prompt becomes too long or too complex.
Imagine you’re playing basketball for the first time. Reading a book about how to throw a ball wouldn’t be enough. You need to practice and build muscle memory.
In fine-tuning, we train the model using examples. You directly show the model how it should respond and, depending on the specific FT method, what to avoid. This helps the model internalize the desired behavior.
As a result, a smaller, specialized model might outperform a larger and, theoretically, smarter off-the-shelf model.
2. Fine-Tuning vs. Reinforcement Learning
There are two primary ways to adjust the model to our needs: Reinforcement Learning (RL) and Fine-Tuning. We already discussed the latter.
In Reinforcement Learning (RL), rather than studying examples, the model explores different options, gets feedback, and improves. Well-known examples include training an AI to play chess or Go, where the gains or losses were easy to calculate.
A special type of RL is Reinforcement Learning with Human Feedback (RLHF), where model responses are assessed by human labelers.
Reinforcement Learning is powerful but also heavyweight and costly. It’s commonly used by AI labs. Fine-tuning is much lighter and cheaper. It’s the primary lever for product teams.
3. Fine-Tuning (Gradient-Based Methods) in Depth
As discussed previously, fine-tuning is about training the model using examples. We typically use JSONL files that contain one training example in each line.
Let’s discuss the common fine-tuning methods together with example JSONL files.
Group 1: Supervised Fine-Tuning (SFT)
SFT is the most common approach to fine-tuning. The model learns how to imitate good examples.
It works best when you have one correct answer per input. One perfect use case is text classification.
An example JSONL file:
Note: I wrapped lines visually, but a real JSONL file would have two lines.
SFT is also commonly used as a first step before applying other fine-tuning methods.
Group 2: Preference-based Fine-Tuning
Preference-based fine-tuning is a group of methods, where instead of just imitating good examples, the model learns what to prefer.
There are three major approaches:
Approach 1: Direct Preference Optimization (DPO)
DPO is the second most common fine-tuning method. It compares your fine-tuned model to a frozen copy of the original model (“reference model”) to prevent drift.
To perform DPO, we must create a dataset of examples and for each define:
prompt (input),
chosen (preferred) output,
rejected (non-preferred) output.
An example JSONL file in the Hugging Face format:
OpenAI requires a slightly more complex structure (even GPT-5 makes mistakes here):
input → a dictionary with messages (system + user).
preferred_output → an array of assistant messages.
non_preferred_output → an array of assistant messages.
Approach 2: Odds Ratio Preference Optimization (ORPO)
ORPO is lighter and cheaper, as no reference model is required. As a consequence, it’s also less stable, as the model might drift.
It uses the same format and structure as DPO:
prompt (input),
chosen (preferred) output,
rejected (non-preferred) output.
Just to visualize, an example JSONL file in the Hugging Face format:
Approach 3: Kahneman-Tversky Optimization (KTO)
KTO is a fine-tuning method inspired by behavioral economics (in particular, risk aversion). It punishes bad outputs more than it rewards good ones.
It’s useful when your feedback signals are noisy or ambiguous (for example, clicks, likes, or upvotes that don’t always reflect true quality).
But there’s a trade-off: models trained with KTO often converge toward “safe, average answers.” To get real value, you need a wide range of positive examples so the model doesn’t just play it safe.
In the screenshot above:
label: 1 means the response is preferred/positive
label: 0 means not preferred/negative
That might resemble datasets for DPO/ORPO, but we could use other values, too:
If the model is configured for classification, it expects integers.
If it’s configured for regression (predicting continuous values), you can use floats (like 0.73 or 4.2).
I haven’t tested KTO. It’s good to know so you know where to start in case you need it, but the details go beyond the scope of this post.
Fine-Tuning Implementation Styles
Regardless of the method (SFT, DPO, ORPO, KTO), there are two main ways to implement fine-tuning:
Way 1: Full Fine-Tuning
Updates all original weights. Very flexible, but slow, expensive, and storage-heavy. Mostly used in research, not product teams.
Way 2: Parameter-Efficient Fine-Tuning (PEFT)
Adds small adapters (LoRA/QLoRA) instead of changing everything. It’s much cheaper, faster, and lighter to store, though less expressive for edge cases.
In practice, PEFT is what most teams use today. It gives you 80-90% of the benefits of full fine-tuning at a fraction of the cost.
Next, only for the premium subscribers, we discuss:
🔒Step-by-Step: How to Teach an LLM Request Classification Using SFT
🔒Step-by-Step: How to Make an LLM More Human-Like With DPO
🔒How to Choose the Right Fine-Tuning Approach
4. Step-by-Step: How to Teach an LLM to Do Request Classification Using SFT
Step 1: Prepare a dataset
Request classification seems like an easy task, but based on my research, most models get 65-75% accuracy even with a few -hot examples (80-85%) due to edge cases.
At the same time, according to my research, a fine-tuned model should reach 90%-95% accuracy pretty reliably.
Let’s test if that’s the case.
Keep reading with a 7-day free trial
Subscribe to The Product Compass to keep reading this post and get 7 days of free access to the full post archives.