Newsletter

How Product Managers Should Learn AI: Operators First, ML Later

A practical 30-day guide for PMs learning AI: operate on real work first, brief models well, add context, test quality, and ship safe AI features.

Aviral Verma, Riya Katiyar

01 May 2026 — 11 min read

Most PMs do not need to begin AI by studying model architecture.

They need to begin by operating with AI on real product work: using it, briefing it, grounding it, testing it, and learning where the output becomes trustworthy enough to matter.

That sounds less glamorous than agents, benchmarks, or neural network diagrams. It is also where the job actually starts.

The market is already past the “should PMs use AI?” phase. Microsoft is selling an AI Product Manager certificate. DeepLearning.AI has introductory courses for non-technical leaders. OpenAI, Anthropic, Google PAIR, Shopify, and Morgan Stanley have all published practical lessons about prompting, evals, human-centered AI design, agents, and production reliability.

That creates a trap: a PM can spend months collecting AI resources and still not build the judgment needed to ship a useful AI product.

The right order is simpler:

Operate with AI on work you already understand.
Learn to brief models like you brief teams.
Give the model real context.
Evaluate outputs before you trust them.
Design the safest useful product version.
Go deeper into ML only when the product problem demands it.

This is not an anti-technical argument. PMs working on AI should eventually understand tokens, context windows, embeddings, retrieval, evals, latency, cost, privacy, and model tradeoffs. But starting with the technical layer is often the wrong first move.

PMs do not learn AI best by memorizing model internals. PMs learn AI best by building taste: what is useful, what is risky, what is missing, and what should ship.

The PM judgment loop

A PM learning AI should be able to answer six questions:

What real work could AI help with?
What does a good answer look like?
What context does the AI need?
How will we catch wrong, shallow, or unsafe output?
What is the safest useful first version?
When should a human stay in the loop?

That loop is the article in miniature: use it, brief it, ground it, test it, ship it safely.

If you can answer those six questions clearly, you are already ahead of many teams chasing impressive demos.

1. Start with boring PM work

Do not start with the flashiest AI use case. Start with work where you already know what “good” looks like.

This matters because AI can sound confident when it is wrong, generic, or incomplete. If you use it on work you understand, you can spot the misses. If you use it on work you barely understand, fluency can masquerade as expertise.

For the next two weeks, use AI on three recurring PM tasks:

summarize customer interviews
extract themes from support tickets
turn meeting notes into a decision memo
draft a PRD outline
rewrite release notes
compare competitor positioning
brainstorm onboarding experiments
identify risks in a product plan
turn messy research into a one-page brief

Choose tasks with fast feedback. If you cannot tell whether the output was good, the task is too far from your current judgment.

Keep a small learning log.

For example:

Task	What I gave it	What worked	What failed	What I changed next time
Interview summary	Transcript + research goal	Found repeated pain points	Missed one important quote	Added “include direct quotes”
PRD outline	Problem + constraints	Created useful structure	Too generic	Added target user and non-goals
Competitor review	Three landing pages	Found positioning themes	Overstated differences	Asked for evidence from the pages only

Every Friday, put each use case into one of three buckets:

Bucket	Meaning	PM action
Keep	Saves time without lowering quality	Turn it into a reusable prompt or workflow
Revise	Useful but unreliable	Add context, constraints, examples, or review steps
Discard	Creates more review work than value	Drop it for now and try a lower-risk task

This log is more useful than another thread about “10 AI tools for PMs.”

You will see the pattern quickly. AI gets much better when the task is clear, the context is real, and the review step is explicit. It gets worse when you ask vague questions and accept the first answer.

2. Treat prompting as brief-writing

Prompting is not magic wording. It is delegation.

A weak prompt fails for the same reason a weak product brief fails: the goal is vague, the audience is missing, the constraints are unclear, and no one defined what a good output should include.

A strong prompt usually tells the model six things:

Role: who should it act like?
Goal: what job should it do?
Context: what information should it use?
Constraints: what must it respect or avoid?
Output: what format do you want?
Check: how should it critique the answer?

Weak prompt:

Suggest features for onboarding.

Better prompt:

Act as a senior PM for a B2B SaaS product. We want to improve activation for trial users who understand the value but drop before setup is complete. Suggest five improvements we could ship in one quarter. For each, include the user problem, likely impact, effort, biggest risk, and how we would test it. Rank the ideas from fastest learning to highest long-term value.

The second prompt is not better because it has clever phrasing. It is better because it contains product thinking.

Before you prompt, write the quality bar in one sentence:

A good answer must help me decide [decision], using [evidence], while respecting [constraint].

That sentence prevents a lot of vague AI work.

A useful short template:

Help me with [task]. Use [context]. Optimize for [goal]. Avoid [constraints]. Return [format]. Before finalizing, tell me what may be wrong, missing, or uncertain.

The last sentence matters. PMs should not only ask AI to produce. We should ask it to help inspect the output.

This is consistent with how the major model labs now talk about prompting. Anthropic’s prompting guidance starts by asking teams to define success criteria and empirical tests before tuning prompts. OpenAI’s current prompt guidance is also less about secret phrases and more about defining outcomes, constraints, evidence, and final-answer expectations.

The PM translation: if you cannot define good work, the model cannot reliably deliver it.

3. Context is where most quality comes from

Generic input creates generic output.

If you ask AI to improve onboarding with no details, you will get average advice. If you give it activation data, interview notes, screen descriptions, constraints, and past experiments, the answer becomes much closer to real PM work.

Before asking AI for important work, build a small context pack:

target user segment
product goal
current problem
relevant data or examples
current flow or screenshots, if useful
constraints such as timeline, engineering capacity, policy, compliance, or brand voice
one example of a good answer, if you have one

Then be strict:

Use only the context below. If the context is not enough, say what is missing instead of guessing.

This is the plain-English version of a lot of AI system design. You may hear teams talk about retrieval, grounding, RAG, tool use, or memory. The PM translation is simpler: the model needs the right source material at the right moment.

You do not need to implement the whole system yourself. But you do need to define what the AI should know, what it should ignore, and what quality bar the answer must meet.

That is product work.

4. A demo is not a product

One good answer does not mean the feature is ready.

AI demos are easy to overtrust because the happy path looks smooth. The real question is what happens on the messy path: missing context, edge cases, stale policies, unclear user intent, sensitive data, and ambiguous requests.

Before launch, define what “good enough” means with a simple test sheet built from realistic examples:

Example	Good answer must include	Bad answer would do this
Cancel a subscription request	Policy, account status, and next step	Promise a refund without checking rules
Summarize an interview	Main pain, direct quote, and confidence level	Invent a theme the user did not mention
Draft release notes	User-facing change, benefit, and limitation	Overpromise what shipped
Compare competitors	Evidence from sources	Make claims without examples

Start with 20 to 50 realistic examples. You do not need a perfect testing system on day one. You need enough examples to see whether the AI repeatedly helps or repeatedly creates review work.

Ask these questions:

Did it answer the actual user question?
Did it use the right context?
Did it invent facts?
Did it miss a key constraint?
Is the answer specific enough to act on?
Does it show sources when trust matters?
Does it ask for help when uncertain?
What is the damage if it is wrong?

The last question is the most important one.

A brainstorming assistant can be imperfect and still useful. A support assistant that changes account status needs a much higher bar. Healthcare, finance, legal, security, and compliance workflows usually need sources, review, audit logs, and tight limits.

OpenAI’s evals guidance says evaluations test model outputs against style and content criteria, and its best-practices guide emphasizes task-specific evals, logging, human calibration, and continuous evaluation. Anthropic’s eval guidance makes the same point in different words: success criteria should be specific, measurable, relevant, and tested against realistic cases.

For PMs, this is the core shift: quality is no longer only a design review or QA step. Quality becomes a product artifact.

5. Start AI features with the safest useful version

When someone says, “Let’s build an AI agent,” slow the room down.

Ask five questions first. Strong answers should sound concrete:

Question	Strong answer
What painful job are we solving?	Users repeat this task often, and it wastes time or causes mistakes.
Why does AI help?	It can summarize, compare, draft, classify, search, or prepare faster with the right context.
What context does it need?	Specific documents, rules, user records, examples, or product states.
How will we catch mistakes?	Tests, sources, review steps, fallbacks, monitoring, and limits.
What is the safe first version?	Draft, recommend, or prepare for approval before acting.

Most teams should begin with help, not autonomy.

Use this trust ladder:

Draft: AI creates a first pass. A person edits.
Recommend: AI suggests a next step. A person decides.
Prepare: AI prepares an action. A person approves.
Act within limits: AI takes low-risk actions with monitoring and rollback.

Pick the lowest rung that still helps the user.

A support reply draft is safer than an AI issuing refunds. A risk flag is safer than an AI blocking users. A prepared CRM update is safer than an automatic account change.

Anthropic’s agent guidance is useful here because it draws a sharp distinction between workflows and agents. Workflows follow predefined paths. Agents dynamically decide what to do and which tools to use. Anthropic’s advice is blunt: find the simplest solution possible and increase complexity only when needed.

That advice is product strategy, not just engineering advice.

Autonomy is not the goal. Trusted usefulness is the goal.

The one-page AI feature spec

When you are ready to propose a first AI feature, do not start with a model choice. Start with the operating surface.

Fill in these fields:

Field	Fill this in
User job	What repeated task becomes easier?
First safe version	Draft, recommend, prepare, or limited action?
Required context	What must the AI know to be useful?
Forbidden behavior	What must it never do?
Review step	Who checks the output before it matters?
Test set	What 20–50 examples prove it works repeatedly?
Failure response	What happens when confidence is low or context is missing?
Success metric	What repeated user behavior proves value?
Cost and latency limit	What delay or model cost would make the feature feel worse than the old workflow?
Data boundary	What information can the system access, retain, or expose?

This is the bridge between “I know how to use ChatGPT” and “I can help ship AI responsibly.”

Three examples worth remembering

The best AI product lessons are not abstract. They show up in how real teams decide where AI belongs.

Morgan Stanley: trust comes from testing, not vibes

Morgan Stanley’s advisor assistant is useful because the work is high trust. Advisors need reliable access to internal knowledge, and bad output can damage client trust.

The important lesson is not “finance uses AI.” It is the control system around the product. OpenAI has described how Morgan Stanley used evaluations, retrieval improvements, expert feedback, and regression testing as part of the path to production. OpenAI also reported that more than 98% of Morgan Stanley advisor teams actively use the AI @ Morgan Stanley Assistant.

PM lesson: if trust is central to the job, testing is not a final polish step. It is part of the product.

Shopify Sidekick: useful AI needs product context

Shopify Sidekick is a good example because merchant questions are not generic. “Which of my customers are from Toronto?” or “Help me write SEO descriptions” depends on store data, product context, admin tools, and safe actions.

Shopify Engineering has written about building production-ready agentic systems, including architecture, LLM-based evaluation, tool complexity, and just-in-time instructions. The plain PM lesson is simple: a useful AI assistant is not just a chat box. It needs the right information, the right tools, and clear limits.

PM lesson: before asking “can the model do this?”, ask “what product context and controls would make this safe and useful?”

Humane AI Pin: ambition does not replace reliability

Humane AI Pin is a helpful cautionary example. The vision was bold: a new AI-first device for everyday use. But for a user, the promise only matters if the product reliably helps in real situations.

That is the trap PMs should avoid. A broad AI vision can sound more impressive than a narrow AI feature. But users do not reward ambition when the basic job is unreliable.

PM lesson: start with one trusted job before expanding the promise.

A 30-day plan

Do this instead of trying to learn everything at once.

Week 1: Operate daily

Pick three recurring PM tasks. Use AI on each. Log what helped, what failed, and what you changed.

By the end of the week, save three prompts that genuinely saved time.

Week 2: Improve your briefs

Rewrite weak prompts with clearer goals, context, constraints, output format, and review instructions.

Practice turning messy asks into clear briefs.

By the end of the week, have one reusable prompt for research synthesis, one for decision memos, and one for product critique.

Week 3: Practice context

Give the AI real source material: interview notes, PRDs, support tickets, analytics summaries, release notes, or competitor pages.

Compare answers with and without context. Write down which context changed the answer most.

By the end of the week, have a repeatable context-pack checklist for your product area.

Week 4: Spec one small AI feature

Pick one real user problem. Write:

the user problem
the safe first version
the context the AI needs
20 test examples
five ways the output could go wrong
the human review step
the metric that would prove repeated usefulness
the data boundary
the cost or latency limit

That artifact is more valuable than a certificate. It proves you can think about AI as a product system.

What to avoid

Avoid these beginner traps:

starting with model benchmarks before product use cases
treating prompting as secret words instead of clear briefing
building agents when a draft, recommendation, or workflow would solve the problem
launching from a polished demo without failure testing
ignoring stale context or data permissions
assuming users want AI to act when they only want help deciding
measuring “wow” instead of repeated usefulness
outsourcing judgment to the model because the answer sounds polished

If the feature does not help with a real task, it is not ready.

Minimal resource stack

Do not binge resources before using AI. Use them to answer questions that come up while you practice.

Use this stack only when it answers a question you have while practicing:

If you need to learn...	Start with...	What to take from it
How LLMs work	Andrej Karpathy, Intro to Large Language Models	A plain mental model for tokens, training, and why outputs can be fluent but wrong.
How generative AI fits business work	DeepLearning.AI, Generative AI for Everyone	A non-technical overview of what generative AI can and cannot do, with work examples.
How to brief models	OpenAI Prompt Guidance and Anthropic Prompting Best Practices	Prompting as clear delegation: outcome, context, constraints, examples, and checks.
How to build reusable AI workflows	Anthropic Courses and Claude Courses	Structured practice for working with Claude, including courses on AI collaboration, Claude development, MCP, and reusable Claude Code Skills.
How to design AI UX	Google People + AI Guidebook	Trust, feedback, mental models, and user control.
How to design agents	Anthropic, Building Effective Agents	The difference between workflows and agents, and why simple patterns often win.
How to evaluate outputs	OpenAI Evaluation Best Practices, OpenAI Evals, and Anthropic Evaluation Guidance	How to turn “seems good” into repeatable tests and rubrics.
How production products behave	Shopify Engineering, Building production-ready agentic systems	Context, tools, evals, tool complexity, and guardrails in a real product system.
How high-trust teams deploy AI	OpenAI, Morgan Stanley uses AI evals	Why reliability work is product work, not only ML work.
How to formalize an AI PM track	Microsoft AI Product Manager Professional Certificate	A structured curriculum, useful after you have started operating with AI on real work.

The sequence matters more than the stack. Use AI on real work first. Then learn the technical vocabulary that explains what you observed.

Operators first. ML later.

That is how PMs should start learning AI.

200,000 Vibe-Coded Projects Launch Every Day. Almost None Get Customers.

The Agent Bottleneck Isn't AI - It's Product Management

Karpathy’s LLM Wiki Points to a New AI Product Moat

Claude Code Source Code Leak: Everything You Need to Know