Frontier Models Set the Ceiling. Local Models Set the Economics.

The next AI strategy question is not just which model is best. It is which model is economically right for this workflow.

Aviral Verma, Riya Katiyar

May 28, 2026

An AI support copilot does not become expensive in the demo.

It becomes expensive after people start depending on it.

The first version answers a few test tickets. Then support agents start using it every day. Then the workflow expands: classify the ticket, summarize the customer history, retrieve relevant docs, draft the response, check tone, detect uncertainty, retry if the answer is weak, escalate if the customer is angry, and log the resolution back into the system.

What looked like one AI feature has become a chain of model calls, retries, retrieval steps, checks, and escalations.

That is where the product question changes.

In the first phase of AI adoption, teams asked: Which model is smartest?

In the next phase, they will ask: Which model is economically right for this workflow?

Frontier models will keep defining the upper edge of AI capability. By frontier models, we mean the most capable closed or hosted models at the edge of benchmark performance, reasoning, coding, tool use, and long-context work — not merely any large model API.

But as AI moves from occasional prompting into repeated workflow execution, capability is not the only constraint. Unit economics becomes product strategy.

A product team that sends every classification, rewrite, summary, routing decision, extraction task, and agent step to the strongest frontier model may ship faster at first. But if usage grows, it inherits a high variable cost base. The companies that do not learn to route work to cheaper models will struggle to scale AI workflows profitably.

That is the real strategic shift.

The market is already hinting at this. Cloud providers now sell routing, batching, provisioned throughput, and cost controls around model usage. Model platforms expose token and trace observability. Device makers are pushing NPUs and on-device inference. Open-weight model ecosystems are becoming easier to deploy. These are not separate trends. They are all signs that AI is becoming infrastructure — and infrastructure eventually gets optimized for cost, reliability, latency, and control.

The future is not frontier models versus local models. It is more likely to be a hybrid architecture: frontier models for ambiguity, hard reasoning, planning, and orchestration; cheaper and more controllable models for repeated execution.

“Local models” in the title is shorthand for a broader set of deployment options: smaller hosted models, open-weight models, self-hosted inference, and on-device inference. None are automatically cheaper. They only win when quality, utilization, latency, and operating cost work.

Still, the direction is clear:

Frontier models set the ceiling. Local models set the economics.

The best model is not always the right business decision

A lot of AI product decisions still start with the wrong default:

Use the best available model everywhere.

That default is understandable. Frontier APIs are easy. They create strong demos. They reduce engineering friction. They are often the right choice for prototypes, low-volume/high-value tasks, hard reasoning, and workflows where a quality failure is much more expensive than inference.

But ease can become an architectural trap.

The moment an AI feature becomes a core workflow, model choice becomes part of the product’s cost structure. A support assistant, sales assistant, coding agent, analytics copilot, PM research assistant, or internal knowledge tool does not just consume intelligence once. It consumes intelligence repeatedly.

Anthropic’s guidance on building effective agents is useful here because it makes the tradeoff explicit: agentic systems can perform better, but they often trade additional latency and cost for that performance. LangSmith’s cost tracking documentation says the same thing operationally: agents at scale introduce non-trivial usage-based costs that can be difficult to track.

The reason is simple. Agentic workflows do not only involve one generation. They involve planning, tool calls, retries, handoffs, guardrails, context accumulation, evaluation loops, and sometimes recovery from failed tool use.

A single model call is easy to price.

A workflow is harder.

An agentic workflow is harder still.

PMs should therefore stop asking only, “Can the model do the task?”

They should also ask:

How many model calls happen per user action?
How many retries happen before success?
Which steps actually require frontier intelligence?
Which steps are repetitive enough for a smaller model?
What is the cost per completed workflow?
More importantly, what is the cost per successful workflow?

That last metric matters. A cheaper model is only cheaper if it preserves task success. If it increases retries, escalations, hallucinations, or user abandonment, the apparent token savings are fake.

The pricing debate is more complicated than “AI will get cheaper”

The strongest counterargument is obvious: model costs are falling.

That is true.

The Stanford 2025 AI Index reported a dramatic decline in inference costs for systems reaching roughly GPT-3.5-level performance, with costs falling more than 280x between late 2022 and late 2024. Epoch AI has also tracked rapid declines in LLM inference prices.

So yes, intelligence is getting cheaper.

But this does not mean cost stops mattering.

First, price declines are uneven. The cost of older or smaller-model-level performance can fall quickly while the most capable reasoning, coding, long-context, and multimodal frontier models remain premium.

Second, usage expands when capability becomes useful. If AI moves from a side feature to a workflow layer, the number of calls can grow faster than unit prices fall. Agents intensify this because one user request can trigger many model calls.

Third, frontier AI providers still need sustainable economics. Training, inference, research talent, and data center capacity are expensive. Many AI products have spent the last few years competing aggressively and subsidizing adoption. It is risky for enterprises to build workflows that only make sense if premium intelligence stays cheap forever.

This does not mean frontier APIs become unaffordable.

It means frontier-only architecture becomes harder to justify as the default.

Product leaders need a model cost strategy the same way cloud teams needed a cloud cost strategy. Not because cloud became useless, but because it became everywhere.

Open, local, self-hosted, and small are not the same thing

The terminology matters.

People often use “open-source model,” “open model,” “open-weight model,” “local model,” “self-hosted model,” and “small language model” as if they are interchangeable. They are not.

An open-source AI model, under the Open Source Initiative’s definition, should give users the freedom to use, study, modify, and share the system, including sufficient information about data, code, and parameters.

An open-weight model makes the trained parameters available. Many popular “open” models are more precisely open-weight: downloadable and useful for fine-tuning or deployment, but not necessarily fully open source under the OSI definition. OSI makes this distinction in its note on open weights.

A local model describes where inference runs: on a laptop, workstation, phone, edge device, or local server. Local does not automatically mean open. Apple Intelligence is a good example: it uses on-device models, but those models are not open-weight.

A self-hosted model is operated by the organization itself, often in its own cloud, VPC, Kubernetes environment, or data center. It may be remote from the user but controlled by the company.

A small language model describes size and deployment profile, not license or hosting. Small models can be open or closed, local or hosted, general-purpose or domain-specific.

This distinction matters because the argument is not “open source is cheaper.”

The argument is that AI teams will need a portfolio of model options: frontier APIs, smaller hosted models, open-weight models, self-hosted models, on-device models, and deterministic software. Each has a different cost, quality, privacy, latency, and operating profile.

The winning question is not, “What is the best model?”

It is, “What is the right model for this step in the workflow?”

The workflow layer will not be frontier-only

Frontier models are still the right choice for many tasks.

Use them when the work is ambiguous, high-value, high-risk, difficult to evaluate, or genuinely reasoning-heavy: strategic synthesis, complex planning, difficult coding, multi-agent orchestration, executive analysis, and tasks where a bad answer is much more expensive than a high inference bill.

But most enterprise workflows are not made only of those tasks.

A support workflow, for example, can be decomposed:

classify the ticket — smaller/local model or deterministic rules
retrieve relevant docs — search, vector retrieval, reranking
summarize account context — smaller or mid-tier model
draft the response — strong hosted or open-weight model
detect uncertainty — classifier/evaluator
escalate complex cases — frontier model or human
log the resolution — deterministic workflow automation

That is a different architecture from “send the whole thing to the best model.”

It is also a better PM conversation.

The point is not to downgrade quality. The point is to allocate intelligence where it creates leverage.

This is becoming more practical because the ecosystem is maturing quickly. Meta’s Llama family helped normalize open-weight deployment at scale. The Llama 4 release pushed open-weight models further into multimodal and mixture-of-experts territory. Tools like llama.cpp and Ollama make local experimentation easier. Hugging Face Text Generation Inference, vLLM, and NVIDIA NIM make self-hosted inference more production-friendly. Apple’s on-device and server foundation model work shows local inference as a platform strategy. Microsoft’s Copilot+ PC/NPU guidance points toward more AI work happening on-device.

None of this means every company should run its own model infrastructure.

It means the model architecture space is no longer one-dimensional.

A practical model-tiering framework

A useful starting framework for PMs:

Tier 1: Frontier models for high-stakes reasoning
Use frontier models for ambiguous, high-value, or high-risk tasks: planning, synthesis, complex coding, difficult analysis, and orchestration.

Tier 2: Strong hosted or open-weight models for common workflows
Use these for repeatable but still language-heavy tasks: support drafts, internal knowledge assistance, structured research, summarization, extraction, and domain-specific workflow steps.

Tier 3: Small/local models for frequent execution
Use smaller or local models for high-volume tasks: classification, routing, templated rewriting, lightweight summarization, privacy-sensitive personal workflows, and simple policy checks.

Tier 4: Deterministic software where AI is unnecessary
Use rules, search, SQL, forms, scripts, product logic, and workflow automation when they are cheaper and more reliable than a model.

The operating principle:

Use frontier models for ambiguity. Use cheaper models for repetition. Use software when intelligence is unnecessary.

A simple routing rule follows:

Start with the cheapest deterministic path.
If language understanding is needed, try the smallest model that passes the eval.
Escalate to a stronger hosted or open-weight model for ambiguous cases.
Reserve frontier reasoning for high-uncertainty, high-value, or high-risk steps.
Continuously measure quality, latency, escalation rate, and cost per successful workflow.

This is why model routing becomes a product capability.

AWS Bedrock’s Intelligent Prompt Routing is already an example of the pattern: route prompts across supported model options to balance response quality and cost, with AWS claiming cost reductions of up to 30% for supported routing use cases. The exact economics will vary, but the product architecture signal is clear.

The future AI stack is not one model.

It is a router.

The product strategy question is workflow ownership

The buyer for this discipline is not only the infrastructure team. It is the PM, founder, AI product lead, CTO, CIO, and product ops leader who owns whether an AI workflow can scale without destroying margin.

The wedge usually starts small: support triage, sales notes, internal knowledge search, coding assistance, PM research, document extraction, or analytics help. But once a workflow becomes trusted, it expands. More users invoke it, more steps get automated, and more teams build dependencies around it.

That is why model economics can become a moat or a liability. A competitor that routes work intelligently can offer richer AI workflows at a lower marginal cost. A team that sends everything to a premium model may look better in the demo but worse at scale.

The defensibility is not in claiming to use the smartest model. Everyone can buy API access. The defensibility is in knowing which parts of the workflow need expensive intelligence, which parts need cheap execution, which parts need deterministic software, and how the whole system is evaluated.

What PMs should instrument

If AI is becoming part of the workflow layer, PMs need to instrument AI like a product system, not a magic text box.

For every important AI workflow, track:

Question	Metric
What workflow is being measured?	Workflow name and user action
How much AI is involved?	Model calls per workflow, tool calls, retries
How much context is consumed?	Input tokens, output tokens, context length
Did it work?	Completion rate, success rate, eval score
Where did it fail?	Human escalation rate, retry rate, abandonment rate
How fast was it?	p50/p95 latency
What did it cost?	Cost per completed workflow, cost per successful workflow
What is the fallback?	Backup model, fallback path, human escalation owner
Who owns quality?	Evaluation owner and review cadence

This table is the difference between “we added AI” and “we understand the economics, quality, and ownership of this AI workflow.”

The former is a feature claim.

The latter is an operating system.

The honest case against open and local models

There is a serious counterargument.

Open-weight, self-hosted, and local models are not free. They can introduce infrastructure, evaluation, security, maintenance, observability, GPU capacity planning, model update, and talent costs. A self-hosted model shifts spend from API tokens to infrastructure and operations. Dedicated GPU endpoints can become expensive if utilization is low. Smaller models can be more expensive in practice if they fail often and require retries or human cleanup.

This is why the argument should not become ideological.

Self-hosting is worth considering when volume is high and predictable, the task pattern is stable, the eval set is strong, privacy or compliance matters, the team has ML infrastructure capacity, or provider-risk sensitivity is high.

It is less attractive when volume is low, the product behavior is changing quickly, evals are weak, the team lacks serving expertise, or latency and availability requirements are high without operational maturity.

Frontier APIs are still the right default for many teams and many stages: early prototypes, low-volume workflows, premium reasoning tasks, and organizations that need capability before infrastructure control.

The point is not to replace frontier models everywhere.

The point is to stop using them everywhere by default.

The individual user angle matters too

This shift will not only happen inside enterprises.

Individuals will feel it as AI becomes part of daily work.

Some workflows are personal and private: notes, files, drafts, local search, document processing, experimentation, lightweight coding assistance. A Mac Studio, GPU workstation, or AI PC will not replace frontier intelligence for hard reasoning, complex coding, or high-quality synthesis. But it may be enough for many frequent personal tasks where privacy, speed, and zero marginal API cost matter.

That should matter to product builders.

Consumer AI products may also become hybrid: cloud models for hard tasks, on-device models for frequent private tasks, and deterministic software for everything that should not be modeled at all.

If intelligence remains priced only like a premium cloud service, adoption becomes uneven. Large companies and wealthy users experiment freely. Others ration usage.

Local and open-weight models do not solve every inequality in AI access.

But they can make intelligence more usable, more private, and more frequently available.

The durable signals are stronger than this week’s headlines

The strongest evidence for this shift is not any single announcement. It is the convergence of infrastructure moves.

Agent frameworks and tracing tools increasingly expose the hidden structure of AI workflows: generations, tool calls, retries, handoffs, and evaluations. That makes cost visible at the workflow level, not just the model-call level.

Cloud platforms are adding routing and batching because model choice is becoming an optimization problem. Device platforms are adding NPUs because some inference will move closer to the user. Open-weight model tooling is improving because teams want more control over deployment, privacy, and marginal cost.

These signals all point in the same direction: AI is moving from isolated prompts into operating workflows. Once that happens, model economics becomes impossible to ignore.

What PMs should do now

PMs do not need to become model infrastructure experts.

But they do need to stop treating model choice as someone else’s backend decision.

For every serious AI feature, ask five questions:

1. What level of intelligence does each step actually require?
Do not send repetitive work to the most expensive model just because it is easy.

2. What is the cost per successful workflow?
Measure the whole workflow, including retries, tool calls, escalations, and failures.

3. Where can we route simple cases to cheaper models?
Preserve frontier intelligence for the cases where it changes the outcome.

4. What happens when usage grows 10x?
If the feature only works economically at low usage, it is not ready to become core infrastructure.

5. What is our fallback model strategy?
Know what happens if pricing changes, latency spikes, quality drops, provider terms shift, or a workflow needs to run in a more private environment.

The companies that learn this discipline early will be able to scale AI workflows profitably.

The companies that ignore it may discover that their AI roadmap works beautifully in demos and badly in margins.

That is the point of the whole piece. The next AI strategy question is not just:

Which model is best?

It is:

Which model is economically right for this workflow?

That is the question product leaders should start asking now.

This week’s AI news confirms the shift

The strongest current signals are not only about bigger models. They are about AI moving into repeatable work systems where cost, routing, review, and improvement loops matter.

Cisco’s Codex rollout shows AI coding agents becoming enterprise infrastructure — Cisco is using Codex for AI-native development, AI Defense work, and defect remediation. The PM takeaway: coding agents are becoming workflow infrastructure, not just developer helpers.
OpenAI’s tax-agent case study shows the next agent moat is the learning loop — OpenAI, Thrive, and Crete’s self-improving tax-agent example points to the next competition in vertical AI: not just who has the best model, but who owns the review and feedback loop.
Anthropic’s Glasswing update shows AI security has a new bottleneck: patching — Anthropic says Claude Mythos Preview and partners found more than ten thousand high- or critical-severity vulnerabilities. The strategic point is that AI can change the constraint from finding issues to verifying, disclosing, and fixing them.

Together, these three stories reinforce the same lesson: once AI becomes operational, the product advantage comes from the system around the model.

Sources and further reading

Anthropic, “Building effective agents” — https://www.anthropic.com/engineering/building-effective-agents
Anthropic Claude pricing — https://docs.anthropic.com/en/docs/about-claude/pricing
Google Gemini API pricing — https://ai.google.dev/gemini-api/docs/pricing
AWS Bedrock pricing and Intelligent Prompt Routing — https://aws.amazon.com/bedrock/pricing/
Stanford HAI, AI Index Report 2025 — https://hai.stanford.edu/ai-index/2025-ai-index-report
Epoch AI, LLM inference price trends — https://epoch.ai/data-insights/llm-inference-price-trends
LangSmith cost tracking — https://docs.langchain.com/langsmith/cost-tracking
Open Source Initiative, Open Source AI Definition — https://opensource.org/ai/open-source-ai-definition
OSI, Open Weights — https://opensource.org/ai/open-weights
Meta Llama 3.1 — https://ai.meta.com/blog/meta-llama-3-1/
Meta Llama 4 — https://ai.meta.com/blog/llama-4-multimodal-intelligence/
llama.cpp — https://github.com/ggml-org/llama.cpp
Ollama — https://ollama.com/
Hugging Face Text Generation Inference — https://huggingface.co/docs/text-generation-inference/index
Apple, Introducing Apple’s On-Device and Server Foundation Models — https://machinelearning.apple.com/research/introducing-apple-foundation-models
Apple Intelligence — https://www.apple.com/apple-intelligence/
Microsoft Learn, NPU devices and Copilot+ PCs — https://learn.microsoft.com/en-us/windows/ai/npu-devices/
NVIDIA NIM — https://developer.nvidia.com/nim

Deploying the Stack: How to Host Your AI Prototype

AI is making hardware expensive. Software is next.

Stop Prompting. Start Delegating.

The Next AI Product Skill Is Designing the Loop