2026 Is Here. Stop Watching AI Models. Start Designing AI Systems.

What actually changed in AI, and what PMs and builders should focus on next.

Jan 05, 2026

Happy New Year!

2026 is here, and the shifts are already visible. Autonomous coding hitting the mainstream. ARC-AGI-2 scores above the human average.

Separating signal from noise keeps getting harder.

So, today, we cover:

AI in 2025: Recap
What the Labs Are Working On?
What to Focus On in 2026

This is highly curated, based on my observations, experiments, and papers I’ve been reading, not common sources.

Seatbelts on. Let’s go.

1. AI in 2025: Recap

1.1 The rise of agentic AI

Throughout 2025, the mainstream narrative about AI agents was a few months behind the actual capabilities.

In mid-2025, the term “agentic AI” took off. Early definitions framed it as systems that could plan, decide, and execute goals with minimal human supervision.

The problem?

Reality lagged that framing. Reliable autonomy wasn’t there yet, despite social media and consulting firms repeating “agentic AI” like a mantra.

I won't lie - I got frustrated watching people post about "agentic AI" without experimenting or talking to practitioners (e.g., the AI Evals Cohort community I’ve been consulting).

I called this out several times:

Agentic AI vs AI agents — Source: Introduction to AI Agents for PMs, June 17, 2025

In June 2025, agents still couldn’t reliably handle more than 2-3 tool calls. This was visible in the Agent Leaderboard by Galileo:

Then, on August 7, 2025, GPT-5 radically reduced hallucinations and improved instruction following. I didn’t trust OpenAI, so I confirmed this by running experiments like this one:

Which LLM should power your AI agents — Posted on August 8, 2025

In October 2025, I ran another experiment, in which a fully autonomous agent succeeded 185/185 times, despite having no step-by-step instructions in the prompt:

Paweł Huryn@PawelHuryn

After an interview with @karpathy, everyone is talking about what AI agents can/can't do. But an opinion without data is just a hypothesis. So, I tested 3x185 workflow executions for a market researcher agent. The results have shocked me🧵

4:33 PM · Oct 22, 2025 · 246K Views

33 Replies · 232 Reposts · 1.44K Likes

This helped me confirm that instead of complicating instructions, we should double down on explaining the why (strategic context, objectives, and how success will be measured) - we previously covered that in A Guide to Context Engineering for PMs.

Later, I demonstrated how to build autonomous agents in this lighting lesson:

How to Build Autonomous AI Agents: The Skill That’ll 10x Your Career

Paweł Huryn

November 20, 2025

Read full story

Surprisingly, in November–December 2025, the social media narrative shifted to “agents don’t work.” We also saw a backlash against AI rising:

While there are many reasons for that backlash, what actually happened with capabilities was the opposite: agents stopped being toys.

By late 2025, agents became reliable in scoped workflows, once hallucinations dropped and planning, instruction following, tool use, evals, guardrails, and recovery matured.

The progress didn’t stop. I kept running experiments like this one, where top LLMs played chess. Every model iteration turned out to be stronger than the previous one.

Just a few days ago, I noticed this graphic posted by Brij kishore Pandey:

Someone finally made sense of “agentic AI.” The term was repurposed to match reality: agents + observability + governance + evals + agentic protocols.

I loved it.

1.2 The rise of vibe engineering

Throughout 2025, I kept repeating that progress in AI coding was crazy. In September 2025, I started talking about vibe engineering:

The Rise of Vibe Engineering: Free Courses, Guides, and Resources

Paweł Huryn

September 19, 2025

Read full story

This term was my attempt to emphasize that you don’t have to code, but you can’t ignore engineering — I learned that by experimenting with projects like Accredia (currently, a B2B2C video course platform supporting our community):

Here we are a few months later. What’s happened in the last 1–2 weeks has been wild:

Senior Google engineer: Claude did in 1 hour what took Google 1 year:
Karpathy (OpenAI co-founder) says AI could make him 10x powerful:
Engineers admit they’ve stopped coding (reported by a former Google engineer and the founder of the coding interview preparation website and platform NeetCode.io):
The evidence keeps spreading on Reddit:

January, 2026:

You don’t have to code or even carefully review everything an agent creates.
What you do need is to ask good questions, understand the architecture, design tests, and interpret simple snippets in a chat window. Keep asking questions and pushing back until you understand the architecture, risks, and tradeoffs.

Some helpful prompts:

17 Penetration & Performance Testing Prompts for Vibe Coders

Paweł Huryn

August 25, 2025

Read full story

1.3 ARC-AGI-2 crushed

On December 11, GPT-5.2 dropped. Many people focused on GDPval (knowledge-work tasks), but what caught my attention was something else: it scored 52.9% on ARC-AGI-2.

ARC-AGI-2 is not a random benchmark. It’s designed to be difficult for AI and to minimize simple pattern matching. The average human score is 60%.

Surprisingly, that wasn’t the end of the story. Just before the New Year, Poetiq hit 75% on ARC-AGI-2 through orchestration, not a better model:

Poetiq@poetiq_ai

We finally had a moment to run our system with GPT-5.2 X-High on ARC-AGI-2! Using the same Poetiq harness as before, we saw results as high as 75% at under $8 / problem using GPT-5.2 X-High on the full PUBLIC-EVAL dataset. This beats the previous SOTA by ~15 percentage points.

7:23 PM · Dec 23, 2025 · 972K Views

128 Replies · 283 Reposts · 2.02K Likes

I don’t believe ARC-AGI-2 represents AGI (however we define it). But this clearly demonstrates progress beyond pattern-matching capabilities.

2. What the AI Labs Are Working On?

Virtually every day, new AI papers drop. It’s impossible to track everything.

For example, on December 31, 2025, DeepSeek released a fundamental improvement in transformer architecture: mHC: Manifold-Constrained Hyper-Connections.

Zooming out, all these algorithmic and architectural improvements eventually leave the labs, compound, and get absorbed into frontier models.

My 2026 prediction: We’ll see more jumps in capabilities, and once trust catches up, adoption will spike.

There are only two trends I want to highlight.

2.1 Continuous Learning

One key weakness of LLMs is their inability to learn at test time (despite papers like Agentic Context Engineering, which are more helpful techniques/workarounds than true solutions).

Google's Titans + MIRAS architecture, last refreshed on December 4, 2025, demonstrated a path toward long-term memory and models that can learn from experience during test time without catastrophic forgetting.

In early tests, this drastically improved accuracy for extremely long contexts, despite using far fewer parameters:

Google's Titans + MIRAS architecture demonstrated a path toward long-term memory and models that can learn from experience during test time without catastrophic forgetting. Continuous Learning. — Source: Google Research

A Google engineer recently predicted:

Note for PMs and builders: These are AI-lab timelines, not when reliable capabilities actually reach production.

2.2 Harness

AI engineers often talk about “harness.” You can think of it as a safety and control layer around a model.

As Poetiq’s 75% ARC-AGI-2 result showed, you don’t need a better model to get better outcomes.

A Stanford paper from December 5, 2025, argues that LLMs are just the substrate. The real leverage is the orchestration layer you ship: how you coordinate multi-agent systems so they stay grounded and correct.

My translation (not their wording):

RAG for grounding
Context engineering for guiding
Tools for action
Verification loop for correctness
Guardrails for safety
Memory, scratchpads, and task lists for continuity
Roles for separation of concerns

Once the context and orchestration give the model enough support, outputs can change dramatically. We’re already seeing this with Claude Code.

Takeaway for PMs and builders: Stop judging “the model” in isolation. Most gains come from orchestration: retrieval, tools, memory, verification, and guardrails. AI evals help you iterate, prove it works, and monitor drift.

2.3 Why waiting for the next model is already late

If you’re waiting for “the next model” to unlock value, you’re already late.

The biggest gains in 2026 won’t come from raw model upgrades. They’ll come from better orchestration, clearer intent, tighter evals, and teams who know how to design systems around AI, not just call APIs.

3. What to Focus On in 2026

If 2025 was about capability catching up with hype, 2026 is about leverage.

The gap is no longer what models can do, but who knows how to use them best. Under real constraints: messy data, partial autonomy, human oversight, and business risk.

Here are the five key areas I want to highlight.

3.1 Product Discovery (still wins)

You might expect me to say “go AI-first.” I won't.

Because product discovery still defines whether you’re solving a real problem or just automating noise. I explained that also in WTF is AI PM.

AI reduces build cost, not the cost of being wrong.

You can’t build random things and throw spaghetti at the wall. You still need to:

Explore the problem space,
Identify real opportunities,
Form clear hypotheses,
Test assumptions deliberately.

AI lets you prototype, test, and discard ideas at unprecedented speed. But discovery decides where that speed is applied.

More & link to a video course free for premium members:

What Is Product Discovery? The Ultimate Guide Step-by-Step

Paweł Huryn

December 8, 2022

Read full story

3.2 Developing AI intuition beats learning AI tools

AI tools change every few months. Mental models compound.

The best way to develop real AI intuition, understand what’s possible, and internalize the details that matter is to roll up your sleeves and build.

Our AI PM Learning Program is designed for that:

Part 1: C11 LLM Practitioner Certified PM
Part 2: C21 AI Agents Practitioner PM

As for tools, you need surprisingly few. You can start with:

Lovable / Google AI Studio / Dyad (build without friction)
Perplexity (research)
n8n (orchestration)
NotebookLM (research, structure)
Claude Code or Claude Desktop + Desktop Commander (Claude Code is amazing for developers, Desktop Commander enables file access and shell commands without terminal and can support many common PM tasks)

3.3 Areas to double down on

Read fewer things. Two posts studied deeply beat thirty skimmed once.

Understanding context engineering and AI evals covers some of the most critical insights I’ve seen recently:

The importance of expressing the intent:
- Most agent failures I see are not model failures, but intent failures.
- Remember to start with Why not What.
- When given the strategic context, agents usually can make reasonable assumptions.
Defining what the good behavior is:
- Start with Evals, not instructions.
- You’re unlikely to get instructions or examples right without data and iteration.

Some of the most important pieces of the newsletter:

A Guide to Context Engineering for PMs
Error Analysis (not easy, but critical)

3.4 Product strategy, distribution, and pricing

When building becomes cheap, strategy becomes expensive.

This aligns with what I first noticed working at Ideals (my last and favorite company, I still wear their t-shirt): PMs were expected to explore the market, contribute to strategy, and identify expansion opportunities.

Useful reads:

3.5 Career shift and the future of PM

You don’t have to code. But roles in AI require increasingly more engineering (resources discussed here).

This is already visible:

Meta explicitly expects PMs to vibe-code
LinkedIn launched the Associate Product Builder Program

Your goal is to drive value, not stay in your comfort zone. Whatever the PM role becomes, think of yourself as someone who solves problems at the intersection of business, users, and technology.

There have never been more opportunities.

Even as traditional roles increasingly overlap.

—

If you want to transition into AI PM, don’t start with a resume.

Solve a real problem or identify an opportunity instead, like Olia Herbelin did by proposing an AI Agents Buildathon for PMs (not my idea!).

That was the first time I’d seen someone approach it this way.

This and other “unfair strategies” top AI PM candidates rely on are described here (a coincidence).

Thanks for Reading The Product Compass

2026 is here. The leverage shift has begun.

Start building. Stop theorizing. We're just starting.

For dozens of AI PM resources, see The Ultimate AI PM Learning Roadmap and our articles archive.

Have an incredible 2026,
Paweł

The Product Compass

How to Build Autonomous AI Agents: The Skill That’ll 10x Your Career

The Rise of Vibe Engineering: Free Courses, Guides, and Resources

17 Penetration & Performance Testing Prompts for Vibe Coders

What Is Product Discovery? The Ultimate Guide Step-by-Step

Discussion about this post

Ready for more?