I Don't Review the Code. I Review the Artifacts.

You don't have to code. What you review instead, why it's the PM job now, and the prompt pack for agentic engineering.

May 31, 2026

∙ Paid

In the last two weeks I shipped three things to my GitHub: PM Brain, a second brain for product managers; Claude Usage for VS Code; and Grok Build, another VS Code extension. In the same two weeks I updated accredia.io (B2B2C), wrote articles, ran experiments on agents, and taught a cohort of students building with Claude.

Pawel Huryn GitHub: Grok Build, Claude Code Usage, PM Brain OS, Accredia.io

More than 800 tests and eval checks across the three repos, from unit tests to LLM judges, nearly all green. I’m a PM, not an engineer, and I barely read the code.

💡 The uncomfortable part for PMs is that this isn’t a future workflow. It’s the job now, once the agent builds faster than the team can align.

And this isn’t just me, or just indie builders shipping side projects. A Google PM described the shift from writing-first to building-first:

Madhu Guru@realmadhuguru

At @Google, we are moving from a writing‑first culture to a building‑first one. Writing was a proxy for clear thinking, optimized for scarce eng resources and long dev cycles - you had to get it right before you built. Now, when time to vibe-code prototype ≈ time to write PRD,

11:23 PM · Jul 29, 2025 · 642K Views

205 Replies · 419 Reposts · 4.82K Likes

It’s not only Google.

Meta PMs now vibe-code prototypes and demo them straight to Zuckerberg, and the company added a prototyping round to its PM interview; LinkedIn replaced its APM program with an “Associate Product Builder” role gated on a live AI build. The prototype is becoming the PM’s first deliverable inside large companies, not a side-project trick.

So what did I review? The artifacts:

the plan before it ran
the decisions it encoded
the strategy the agent reads every turn
the docs that say what the thing does and why

💡 A large part of the PM job is moving from writing spec to maintaining the context that steers agents and reviewing the artifacts that tell you what changed. The conversations with users, stakeholders, and teams stay; the build loop changes.

It’s the same shift whether your surface is a repo, Claude Code, Cowork, or a chat window. Experimenting got cheaper than building agreement, and that reorders the work.

None of them started with a spec. Each started with a conversation, and what I kept was the artifacts.

What this post covers:

What the artifact layer is, and why a PM reviews it instead of the code
Why building now comes before agreement, not after
Where your strategy has to live to actually steer the agent
The boundaries that let you hand an agent real autonomy and walk away
The AI Shipping Artifact Prompt Pack to download at the end

What I Actually Ship Now

The product is the obvious artifact: the extension on the Marketplace, the tool people install. The one that decides whether the next change goes well is less obvious. It’s the knowledge layer.

Not the README or the changelog, every repo has those. The docs that hold the decisions: why this architecture, who it’s for, what we ruled out:

In PM Brain that’s files like architecture.md and why-this-matters.md
For accredia.io it was more than twenty docs, from database.md and security.md to a fuckups.md
For a multi-agent app I built with my cohort, it was the agent contracts: what each agent takes in, what it returns, which tools it can touch, and how they hand off.

💡 Same layer whether I’m shipping an extension, a knowledge system, or a system of agents.

Keeping that layer true is a question of what to automate, and when:

In claude-usage, git tags are generated from the CHANGELOG automatically: the changelog is the source of truth, the tag is a deterministic projection.
In Grok I keep version bumps manual on purpose, because automating them there hasn’t earned its keep yet.

📌 For PMs: Automate the friction that’s real and recurring, not the friction you imagine you’ll have.

Seven things shipping this way taught me.

Lesson 1: Review Artifacts, Not Code

The question I get from PMs is “do I have to learn to code?” I don’t, and I don’t review the code, beyond the odd snippet.

💡 The code is the agent’s, the way code in any company is the engineers’, not the CEO’s. The CEO is still accountable for what ships, but doesn’t earn that accountability by sitting in the editor. What I review is the artifact layer above it.

That layer isn’t a side effect of the build. It’s a deliberate, human-and-agent-readable summary of the solution, so neither of us has to reverse-engineer the product from the source. It often duplicates what the code already says, and that’s the point: it’s far easier to consume.

It’s also what I check the code against. When I care whether the implementation matches the intent, whether the security model holds, whether performance is where it should be, I read the artifact, not the diff.

So when work comes back, I don’t read changes in the code. I read changes in the artifacts. I review across three modes, escalating only as far as the decision needs:

Review Artifacts, Not Code; Agentic Engineering

Cheapest mode first. A durable artifact is the receipt: it survives the context window, and it survives me. Only commit one when the alternative is losing the thread.

Here’s one from a build. I opened the session with a document, not a ticket: “See the research in plan-mode.md. Can we really implement that logic reliably on our side?”

The agent didn’t start writing code. It read the doc, ran a quick probe to test the doc’s core assumption against the real tool, and reported what it found. I signed off on the approach from that exchange, not from a single line of the implementation.

AI Coding Agents: Work from the artifact

📌 For PMs: the unit of review changed. You're not approving lines of code. You're approving a description of what changed, why, and whether it should ship. If your artifacts can't answer that, no amount of code-reading will.

And this matters more on a team, not less: your teammates don’t carry your context in their heads, so the artifact layer is how a decision or a hard-won finding survives past the person who made it.

Lesson 2: Build Before Building Agreement

The old motion was: write the PRD, get everyone to agree, then build. That order made sense when building was the expensive part. You spent weeks of engineering, so you spent days up front making sure it was the right thing.

💡 Building isn’t the bottleneck anymore. The first working version is often cheaper than the meeting about it, an afternoon instead of a sprint. So the order flips: you build the thing, then align around what you can see instead of arguing about what you imagine. Experimenting got cheaper than building agreement.

That’s why I shipped two major releases in a day, and why none of these projects started with a spec. Two-week iterations start to feel dead in this mode. Not because planning is useless, but because the unit of learning changed: when a working version costs an afternoon, a sprint is too slow to be the main loop. You ship, watch, revise, and ship again before the old process would have turned the debate into tickets.

💡 A document can’t compete with that, because it forces everyone in the room to build a different product in their head. One imagines the happy path, one the edge cases, one the UI, one the roadmap impact. Then the meeting pretends those are the same thing. They aren’t.

A prototype, even a rough one, collapses the ambiguity: the team sees what the customer clicks, where they hesitate, what they ignore, what they ask for next. Highly regulated industries aside, agreement around an untested doc is low-value at best: the idea changes the moment it touches reality. At worst it's harmful, because the team gets attached to a product that only existed in language.

📌 For PMs: stop spending your most expensive currency, alignment, on ideas nobody has seen. Build first, then bring to the meeting:

the screen recording
the failed path
the confusing label
what people actually did

Alignment gets easier when the room is looking at evidence or something you can inspect instead of trying to imagine the same future.

Lesson 3: Put Strategy Where the Agent Works

I still write the strategy down:

who it’s for
what changes for them
what I won’t build
the trade-offs I’ll accept
what would make the work a failure

What changed is where it lives and what it’s for. It doesn’t sit in a folder someone might open before a kickoff. It lives in the repo, a strategy.md the agent reads every time it acts.

💡 Leading with strategy matters because the agent isn’t a tool waiting for instructions. It’s a partner making hundreds of small product decisions while it builds.

If the strategy isn’t in its context, those decisions get made from whatever’s nearby: the last prompt, the surrounding code, a guessed convention, the model’s own confidence.

The repo becomes the product’s memory, too:

experiments
decisions
evals
screenshots
release notes
the constraints I hit

They accumulate there anyway. My job is to keep that context clear enough that the agent can use it. (The one thing I stopped writing is the spec, the user stories and acceptance criteria that told engineering exactly what to build. That part is cheap now; it falls out of the conversation.)

Your strategy is only doing work if it changes what the agent does next.

Lesson 4: Cross-Examine Confident Answers

I asked Claude to bump the version and publish. It was confident: “Just run the publish command. CLAUDE.md says the publisher’s already authenticated.” I almost did it. Then I asked: “Are you sure? I’ve been publishing by hand through the Marketplace website.”

The confidence collapsed. That claim was from a past session I’d never verified, and the setup it promised had never happened. The command would have failed at the first step.

That one was harmless. A command fails in front of you, you fix it in a minute. The claims that bite are the confident ones you act on instead of run, where the failure is silent and lands downstream. Pushback is how you catch those before they cost you. The moment you ask “are you sure?” carries more weight than the next instruction you give.

But you can’t interrogate everything, or you become the bottleneck you were trying to remove. So I triage. I cross-examine the confident answers that are expensive to be wrong about:

irreversible changes
anything touching auth, billing, security, or the data model
anything that becomes context future work depends on
any claim that sounds too clean

💡 The rest I let ride. Skepticism is a budget. Spend it where being wrong costs the most.

The strongest version of pushback isn’t you, though. Your attention doesn’t scale. A second model gives you more review surface without funneling every hard question back through your head.

When I had the extension audited for bugs, the agent built a confident, elaborate theory and walked straight past a one-line bug in the same function: a failure was being reported as a success. Trust that, and it keeps building on a false "it worked": wrong analysis, maybe a destructive next step. Codex caught the one-liner Claude had talked itself past.

A second copy of the same model shares its blind spots. A different model, trained differently, fails differently. Self-review feels like diligence. Cross-review is. (I wrote up how I wire Codex alongside Claude separately. You can watch one play out in public: the cross-check narrowed a contributor’s over-broad fix to a single bad value, and he confirmed it.)

📌 For PMs: the cheapest quality gain in agentic engineering is pushing back where it counts, then routing the answer past a second, different model.

Lesson 5: Make Failures Teach the System

The best artifacts in my projects usually began as failures I didn’t want to debug twice.

The loop is the same every time: a failure surfaces, I find the smallest thing that would have caught it, a test, an eval, or a line of policy, and I make it permanent so the system can’t forget it.

The most load-bearing line in my CLAUDE.md is four words: “Don’t introduce abstractions speculatively.” It earned its place after an earlier project drowned in structure it didn’t need. A line copied off social media is a guess about a problem you might have; a line written after the third time something went wrong is policy. accredia.io keeps a fuckups.md in its docs; that’s where its rules come from.

Tests are the same idea in a different file. Early on, the model offered a 27-test suite and I took it, but the next bug I hit wasn’t in any of the 27, it was a sequencing problem they tested around. So now, when a real bug shows up, I ask: “What’s the smallest test that would have caught this?” and add that one.

The claude-usage triage routine mechanizes it: on every fix it adds the test, then reverts the fix to confirm the test fails without it. A green suite proves the past; manual smoke probes the future.

Agentic Engineering: Make Failures Teach the System

Evals are the same instinct for AI that’s in the product: read the traces, name what’s failing, then turn each failure into a test, an LLM judge, or a new line in CLAUDE.md. PM Brain’s judges are each named after a real failure (decision_quality, evidence_hierarchy_respected), never from a spec up front. (I go deeper in evaluating AI products through error analysis.)

📌 For PMs: the AI writes the code, the tests, the eval harness. You decide which failure is worth learning from, and you make the lesson stick. That judgment is the part you can’t delegate.

Lesson 6: Point Every Agent at One Source of Truth

I had three coding agents open during this build: Claude Code in the terminal, a Grok tab in VS Code, and Codex through a separate skill. Same repo, same files, three independent conversations. Without a shared source of truth, they’d have proposed three different architectures, and I’d have spent the build arbitrating instead of building.

So I keep one master doc: CLAUDE.md, with the project, the conventions, and the decisions already made. The other agents don’t read it by default, so each gets a tiny AGENTS.md whose only job is to point them at it. It duplicates nothing; it just redirects.

In claude-usage I flipped the direction: there AGENTS.md holds the guidance and CLAUDE.md is one line, @AGENTS.md, that imports it. Either way works. One source, everything points at it.

Agentic Engineering: Point Every Agent at One Source of Truth

It isn’t only CLAUDE.md, either. The source of truth is the whole working-context layer, the docs, rules, skills, and hooks each agent can reach, and I keep that synced too. Each agent can have its own entrypoint; it shouldn’t have its own reality. (Full setup in Codex Setup for PMs.)

📌 For PMs: this is closer to running a small engineering team than “AI pair programming.” Each agent has a different bias, and you don’t manage that with three sets of instructions. You manage it with one source of truth and a pointer at it. The skill is integration, not prompting.

Lesson 7: Grant Autonomy, Enforce Boundaries

Early on, I babysat every run. I watched the model think, approved each step, sat at the screen the whole time. That doesn’t scale across the number of things I’m trying to ship.

The shift was setting a goal, giving the agent room to pursue it, and walking away. The clearest example runs every week without me. The goal: go through every open pull request and feature request on claude-usage, merge the clear bug-fixes, mark the duplicates, push out-of-scope features to a Discussion, and report back only what couldn’t be resolved. It runs headless, on a schedule. I read the summary later, often from my phone.

Autonomy that wide only works with two kinds of limit, and the difference is the whole game:

Steering prompts are soft: a one-line reminder, or a repeatable procedure like the gate rules in my content system. Either way, under pressure the model can talk itself out of one.

Hard boundaries are enforced where the model gets no vote. In the triage routine:

nothing closes unless the second model writes a sign-off file with an exact approval phrase
it won’t auto-merge anything touching auth, secrets, or more than 200 lines
it never pushes to main

Those aren’t requests. They’re mechanical gates.

The strongest version of a hard boundary is the tool surface itself. In the Grok extension, plan mode isn’t a polite “please don’t edit files yet.” The extension gates the actual file-write and terminal calls, so while a plan is pending the model physically cannot touch the disk.

Same idea as a Claude Code hook or a tool allowlist: if a limit has to hold, it lives in code, not in a sentence the model can reinterpret. PM Brain enforces a different limit the same way: a hook fires after every file write and blocks the agent from saving a brain file with an untagged claim. No clean save, no write.

The same logic runs at the shipping boundary. On claude-usage, no pull request merges to main on a red test suite. I didn’t know how to wire that up, so I asked Claude to set the branch protection for me. The agent can be confident and a contributor can be in a hurry; the merge button stays disabled until the tests pass.

Agentic Engineering: Grant Autonomy, Enforce Boundaries

PM Brain’s /review edits knowledge files on its own, but it’s hard-stopped from changing a decision’s status; that judgment is mine. Wide lane, real walls.

📌 For PMs: this is the steering-versus-guardrails split from intent engineering, made concrete. Sort your limits into preferences and non-negotiables. Preferences go in the prompt. Non-negotiables go in hooks, tool permissions, and mechanical gates. Then go do something else.

What Still Needs Human Judgment

Here’s what the system still can’t decide for me.

Models over-rate their own findings. In that audit, the agent tagged three findings “high severity.” Checked against the actual code, all three were medium or low. The model is good at producing candidates. Deciding what’s actually serious is still mine.
A second model helps; it isn’t magic. Two models from the same family share blind spots. Claude and GPT diverge enough to be worth pairing, but neither makes the load-bearing check stop being yours.
Taste regressions still need eyes. No agent caught the copy button overlapping the header buttons. I did, on a manual pass. Design judgment isn’t in the test suite.
The workaround becomes part of what you ship. Grok’s CLI returned unreliable tool results, so every session in the extension opens with a hidden primer: “Do not trust the tool result.” That scar tissue is product judgment too.

So What Do PMs Ship Now?

Across all of it, the extension, the knowledge system, the premium section behind this newsletter, the pattern held.

None of it shipped because I wrote a better spec, and none of it shipped because I read the code. It shipped because the conversation was good, and because I kept the few artifacts that turned out to be load-bearing: the strategy, the decisions, the constraints, the CLAUDE.md the agent actually reads.

PM Brain made it obvious. The brain it scaffolds is 44 markdown files around a single CLAUDE.md, and that structure emerged from use, not from a document I wrote first.

📌 For PMs: If you’re about to ship something with AI, stop asking whether you need to read the code. Ask instead: which artifacts would let you, and the agent, understand what changed, why, and whether it should ship? Build those. Keep them true. That’s the job now, and that’s the artifact you actually shipped.

The AI Shipping Artifact Checklist

If the artifact layer is what you actually ship, here’s the starting set. Organized by purpose, not by stack, because the project changes but the knowledge-layer needs don’t. Steal the list; name the files whatever fits your repo.

None of these is a spec. Each is a place the decisions live, so you and the agents can read them instead of reverse-engineering the code. You don't have to hand-write them, either.

The AI Shipping Artifact Prompt Pack

One download, two kinds of thing, both built around the artifact layer above:

The artifact prompts: seventeen of them, one per kind of file and risk, from strategy to workflow maps to cross-model review. Point an agent at your repo, run a prompt, review the doc it writes. They generate the layer; they aren’t the layer.
The audit commands: a static security audit and a performance twin, for code you didn’t write. The security one, /security-audit-static, traces untrusted input to the places it could do real damage, then refutes its own findings before reporting. I pointed it at Langfuse, a leading LLM engineering platform, and it surfaced a real weakness in how provider connections were fetched. The hardening fix merged, 964 lines added, 334 removed.

Some of the 20+ issues I’ve since responsibly disclosed to public-repo security teams (Langfuse at the top):

/security-audit-static, responsible disclosure, agentic engineering

Treat the pack as a menu. Reach for the prompt that matches the risk in front of you. It complements engineering review; it doesn't replace security, infra, or code-owner sign-off where those are required.

Get the pack below: 👇

Keep reading with a 7-day free trial

Subscribe to The Product Compass to keep reading this post and get 7 days of free access to the full post archives.