In June 2026, frontier LLM output prices span 643×—this guide maps cost, config, performance, and audience so your bill and context window stay under control.
2026 LLM pricing at a glance
1.1 Flagship tier: peak capability, peak price
| Model | Vendor | Input | Cached input | Output | Context |
|---|---|---|---|---|---|
| GPT-5.5 Pro | OpenAI | $30 | — | $180 | ~1M (effective ~258K) |
| Claude Fable 5 | Anthropic | $10 | $1 | $50 | 1M |
| GPT-5.5 | OpenAI | $5 | $0.50 | $30 | ~1M (effective ~258K) |
| Claude Opus 4.8 | Anthropic | $5 | $0.50 | $25 | 1M |
| Claude Sonnet 4.6 | Anthropic | $3 | $0.30 | $15 | 1M (flat rate) |
| Gemini 3.1 Pro ≤200K | $2 | $0.20 | $12 | 2M | |
| Gemini 3.1 Pro >200K | $4 | $0.40 | $18 | 2M | |
| DeepSeek V4 Pro | DeepSeek | $0.435 | $0.0036 | $0.87 | 128K–1M |
Three counterintuitive facts:
- Gemini 3.1 Pro is the cheapest flagship. Per million tokens, input runs 60% below GPT-5.5 and output 60% below—advantage grows in long-context workloads.
- Claude Opus 4.8 and GPT-5.5 share the same input price ($5), but Claude output is 17% cheaper. Generate 1M output tokens and Opus saves you $5.
- DeepSeek V4 Pro output costs less than Gemini's cheapest Flash-Lite. This isn't "open-source good enough"—it's official commercial API pricing.
1.2 Mid tier: the daily production sweet spot
| Model | Input | Output | Context | Best for |
|---|---|---|---|---|
| GPT-5.4 | $2.50 | $15 | 1M | Balanced pick inside the OpenAI stack |
| GPT-5.3 Codex | $1.75 | $14 | 128K | Code completion, IDE integration |
| Gemini 3.5 Flash | $1.50 | $9 | 1M | Multimodal + faster inference |
| Claude Haiku 4.5 | $1.00 | $5.00 | 200K | Low latency, high concurrency |
| Kimi K2.6 | $0.60 | $2.50 | 262K | Long-form CJK understanding |
| Qwen3.5-Plus | $0.40 | $2.40 | 1M | Alibaba Cloud ecosystem, CJK workloads |
1.3 Economy tier: the moat for high-volume calls
| Model | Input | Output | Notes |
|---|---|---|---|
| GPT-5.4-nano | $0.20 | $1.25 | Lowest US closed-source tier |
| Gemini 3.1 Flash-Lite | $0.25 | $1.50 | Native multimodal |
| Gemini 2.5 Flash-Lite | $0.10 | $0.40 | Rock-bottom pricing |
| DeepSeek V4 Flash | $0.14 | $0.28 | Cache-hit input $0.0028 |
| 小米 MiMo-V2.5-Flash | $0.10 | $0.30 | Ultra-low China-built tier |
| Grok 4.1 Fast | $0.20 | $0.50 | 2M context + live search |
How wide is the spread? On output tokens, with DeepSeek V4 Flash as 1× baseline: GPT-5.5 is 107×, GPT-5.5 Pro is 643×, Claude Fable 5 is 179×.
Configuration: what actually drives your bill
2.1 Context windows: advertised ≠ usable
The context vendors advertise and the context you can rely on in production are often two different numbers.
| Model | Advertised context | Practical ceiling | Gotcha |
|---|---|---|---|
| GPT-5.5 | 1M | Lossy compression after ~258K | Long agent jobs "forget" mid-run |
| Claude Sonnet 4.6 | 1M | 1M flat rate, no tier jump | Best long-context value |
| Gemini 3.1 Pro | 2M | Input price doubles past 200K | Price RAG full-doc dumps before you ship |
| DeepSeek V4 Pro | 128K–1M | Depends on deployment tier | Extra compliance review outside China |
| Kimi K2.6 | 262K | 262K | Strong on long CJK documents |
Selection tip: If your RAG pipeline routinely exceeds 200K tokens, pick Claude Sonnet 4.6 (1M flat rate) or keep Gemini 3.1 Pro requests under 200K. Otherwise bill and latency both spiral.
2.2 Prompt Caching: up to 90% off, three different playbooks
In 2026, running production without caching means paying full freight for system prompts and doc libraries on every request.
| Vendor | Cache discount | Mechanism | Caveat |
|---|---|---|---|
| Anthropic | Up to 90% | Manual cache_control breakpoints |
5-minute / 1-hour write pricing tiers |
| OpenAI | 50% | Automatic, no config | 1024+ tokens, identical prefix → hit |
| Up to 90% | Implicit + explicit | Hourly storage fee on top—low hit rate can cost more | |
| DeepSeek | Up to 99% | Automatic | V4 Flash cache-hit input only $0.0028/M |
Typical savings: Assume 1M input tokens/day, 60% repeated system prompt + RAG context:
- Claude Opus 4.8: $5 → ~$2.3/day (54% saved)
- GPT-5.5: $5 → ~$3.2/day (36% saved)
- Gemini 3.1 Pro: $2 → ~$1.1/day (45% saved)
- DeepSeek V4 Pro: $0.435 → ~$0.05/day (89% saved)
2.3 Batch API & reasoning tiers
- Batch API (OpenAI / Anthropic / Google): Another 50% off non-real-time work—offline data processing, bulk translation, benchmark sweeps.
- Reasoning effort tiers: GPT-5.5
xhigh, Claudeextended thinkingburn hidden reasoning tokens billed as output. A reply that "looks like 500 tokens" can consume 5,000+ reasoning tokens. - Priority queue (OpenAI): 2.5× surcharge for lower latency. Rarely worth it except SLA-sensitive online services.
2.4 Tokenizer traps: same CJK text, 35% more tokens
Anthropic changed tokenizers starting with Opus 4.7—identical text can use up to 35% more tokens. List prices didn't move; your bill did. For CJK workloads, DeepSeek and Qwen tokenizers usually beat GPT-family counts—that's not rounding error; it's a 10–20% cost gap.
Performance: what benchmarks tell us
3.1 Coding: SWE-bench Verified (June 2026)
SWE-bench Verified tests whether a model can fix real GitHub issues—500 human-verified tasks, far more meaningful than "write Hello World."
| Rank | Model | SWE-bench Verified | Output price ($/M) |
|---|---|---|---|
| 1 | Claude Fable 5 | 95.0% | $50 |
| 2 | Claude Opus 4.8 | 88.6% | $25 |
| 3 | GPT-5.5 | 82.6% | $30 |
| 4 | Claude Opus 4.7 | 82.0% | $25 |
| 5 | Gemini 3.5 Flash | 79.8% | $9 |
| 6 | Gemini 3.1 Pro | 80.6% | $12 |
| 7 | DeepSeek V4 | ~81% | $0.87 |
How to read the board:
- Claude still leads on code. Fable 5 and Opus 4.8 sit a tier above the rest. If you live in Cursor, Claude Code, or Devin-class tools, the gap shows up as "fixed on the first try vs not."
- GPT-5.5 is strong overall, not #1 on code. Tool calling, multimodal, and ecosystem integration are its home turf.
- DeepSeek V4 at 81% on $0.87/M output is wild value. For indie Vibe Coding, it's the lowest-cost "good enough" tier.
Heads-up: SWE-bench scores depend heavily on agent scaffolding. Vendor self-reported numbers often run 15–30 points above standardized public evals. Compare like-for-like scaffolding, not absolute scores.
3.2 Reasoning & knowledge: MMLU-Pro, GPQA, long context
| Capability | Leaders | Notes |
|---|---|---|
| Complex multi-step reasoning | Claude Fable 5, GPT-5.5 Pro | Math proofs, legal analysis, research assist |
| Long-document understanding | Gemini 3.1 Pro (2M), Claude Sonnet 4.6 (1M flat) | Drop a whole PDF in and Q&A |
| Multimodal (image/audio/video) | Gemini 3 family, GPT-5.5 | Native vision + audio understanding |
| Live search | Grok 4.x | News and sentiment needing fresh data |
| CJK comprehension & generation | DeepSeek V4, Qwen3.5, Kimi K2.6 | Higher CJK token efficiency |
3.3 Latency & throughput
| Model | Time to first token | Throughput | Best for |
|---|---|---|---|
| Claude Haiku 4.5 | Very low | High | Live support, real-time classification |
| Gemini 2.5 Flash-Lite | Very low | Very high | Millions of calls per day |
| GPT-5.4-nano | Low | High | Light tasks inside OpenAI stack |
| Claude Opus 4.8 | Medium | Medium | Complex single-turn reasoning |
| Claude Fable 5 | High | Low | Long-running agents—seconds don't matter |
Who should use what
4.1 Indie developers / Vibe Coding
Recommended stack:
- Daily coding: Claude Opus 4.8 (API) or Claude Code Max $100/mo subscription
- Budget fallback: DeepSeek V4 Pro
- Ultra-light: Gemini 2.5 Flash-Lite
Do the math: Claude Code Max at $100/mo ≈ 50 heavy Opus sessions. If you code 2+ hours daily, subscription beats pay-per-token. Below that, DeepSeek V4 Pro API is cheaper.
Critical: Set a spending hard cap in Cursor and similar tools. Community reports show MAX mode burning $11,922 in four weeks.
4.2 Startup teams / small SaaS
Recommended stack:
- Core reasoning: Gemini 3.1 Pro (flagship value)
- Code agents: Claude Sonnet 4.6 (1M flat context pricing)
- High-volume backend: DeepSeek V4 Flash + Batch
- Model routing: Sonnet for hard tasks, Flash-Lite for simple classification
Monthly estimate (mid-size SaaS, 5M tokens/day):
| Approach | Monthly cost (no cache) | Monthly cost (40% cache) |
|---|---|---|
| All Claude Sonnet 4.6 | ~$3,900 | ~$2,574 |
| All Gemini 3.1 Pro | ~$2,640 | ~$1,743 |
| All DeepSeek V4 Pro | ~$438 | ~$289 |
| Routed (20% Sonnet + 80% Flash) | ~$1,200 | ~$750 |
4.3 Enterprise / compliance-sensitive teams
Recommended stack:
- International entities: AWS Bedrock (Claude) or Vertex AI (Gemini)
- Code security review: Claude Opus 4.8 + private Git integration
- Not recommended: third-party OpenAI proxies (cross-border data risk > savings)
Must-do checklist:
- Per-API-key budget caps and alerts
- Enable Prompt Caching (30–50% off in production)
- Model routing policy—don't send every request to Opus
- Run code agents in isolated environments (Cloud Mac / containers)—never bare-metal execution
4.4 AI developers / agent framework builders
Recommended stack:
- Long-horizon autonomous agents: Claude Fable 5
- Tool-call orchestration: GPT-5.5
- Local dev & test: Apple Silicon Mac + quantized Qwen/DeepSeek
- Production fallback: Gemini 3.1 Pro (long context + low price)
Why Apple Silicon? In 2026, agent development bottlenecks aren't only model APIs—they're the execution environment. Claude Code needs to run Xcode tests on macOS, verify iOS builds on real devices, and stay alive in tmux overnight. The smartest model is wasted if SSH drops mid-run and you've already burned dollars in tokens. See The Model Arms Race Is Over—Why Mac Compute Nodes Are Suddenly Hard to Get.
4.5 Global SaaS / multilingual support
Recommended stack:
- Primary: DeepSeek V4 Pro (translation, summarization, support)
- US/EU-facing: Gemini 3.1 Flash-Lite or GPT-5.4-nano
- Quality polish: Claude Haiku 4.5
4.6 Students / researchers
Recommended stack:
- Gemini 3 Flash Preview (free tier available)
- DeepSeek V4 Flash (cheap experiments)
- Local: Mac Mini M4 running 7B–32B quantized models for prototyping
Real cost math: three scenarios
Scenario A: AI support bot (100K turns/day)
Assume per turn: 2K input + 500 output, 80% repeated system prompt (cache hits).
| Model | Daily cost | Monthly cost |
|---|---|---|
| GPT-5.4-nano | ~$5.5 | ~$165 |
| Gemini 2.5 Flash-Lite | ~$3.2 | ~$96 |
| DeepSeek V4 Flash | ~$1.8 | ~$54 |
| Claude Haiku 4.5 | ~$12 | ~$360 |
Verdict: Support doesn't need flagships. DeepSeek V4 Flash or Gemini Flash-Lite is enough—keep monthly spend under $100.
Scenario B: Code agent (50 repo-scale jobs/day)
Assume per job: 50K input + 20K output, 10 tool-call rounds.
| Model | Daily cost | Monthly cost |
|---|---|---|
| Claude Opus 4.8 | ~$50 | ~$1,500 |
| GPT-5.5 | ~$58 | ~$1,740 |
| DeepSeek V4 Pro | ~$2.5 | ~$75 |
| Claude Fable 5 | ~$100 | ~$3,000 |
Verdict: Opus 4.8 for quality, DeepSeek V4 Pro to save money (accept some success-rate drop), Fable 5 for long autonomous runs.
Scenario C: Long-document RAG Q&A (1,000 queries/day, 150K input each)
| Model | Daily cost | Monthly cost |
|---|---|---|
| Gemini 3.1 Pro (≤200K) | ~$360 | ~$10,800 |
| Claude Sonnet 4.6 (1M flat) | ~$495 | ~$14,850 |
| Gemini 3.1 Pro (>200K tier) | ~$540 | ~$16,200 |
Verdict: For long-doc RAG, prefer Gemini 3.1 Pro under 200K or Claude Sonnet 4.6's 1M flat rate. Optimize chunking before launch—don't pipe the whole book on every query.
Five rules for 2026 model selection
- Profile request shape first, then pick a model. High output ratio → flagship; high repeated input → cache-friendly; long context → flat-rate tier.
- Route, don't monolith. The cheapest 2026 stack isn't one model—it's 80% Flash traffic, 20% flagship.
- Caching is mandatory, not optional. Production without Prompt Caching is a voluntary 30–50% tax.
- Total cost beats sticker price. DeepSeek is cheapest on paper; outside China, add compliance audits, account stability, and cross-border data risk.
- Models are the brain; execution is the body. In the agent era, API spend is half the story—the other half is whether the machine running the agent stays up 24/7.
Apple Silicon: local compute + cloud API hybrid
The pragmatic 2026 AI stack isn't all-API or all-local—it's layered:
| Layer | Workload | Hardware / service |
|---|---|---|
| Local (Apple Silicon) | Code completion, small-model inference, data prep | Mac Mini M4 / M4 Pro, 7B–32B quantized |
| Cloud API (per token) | Complex reasoning, long context, multimodal | Claude / Gemini / DeepSeek |
| Cloud execution node (per hour) | Agents running Xcode, CI builds, long jobs | Cloud Mac (Vuncloud) |
Apple Silicon unified memory gives M4 series a natural edge running 14B–32B quantized models—low power, quiet, no NVIDIA required. What local can't do: Claude Code compiling iOS projects, Xcode UI tests on macOS, a weekend migration job in tmux. In those cases, execution-node stability matters more than model choice.
FAQ
What's the cheapest production-ready model in 2026?
DeepSeek V4 Flash ($0.14/$0.28) and Gemini 2.5 Flash-Lite ($0.10/$0.40) tie for the floor. For CJK text, DeepSeek's tokenizer is more efficient—you may pay less in practice.
Is GPT-5.5 still worth it after the price hike?
If you're deep in OpenAI (Assistants API, Realtime voice, DALL·E / Sora, Azure OpenAI), GPT-5.5 stays mandatory. Pure text/code workloads: Gemini 3.1 Pro and Claude Opus 4.8 deliver better value.
Claude Opus 4.8 or GPT-5.5?
Code agents → Opus 4.8 (6 points higher on SWE-bench, 17% cheaper output). Tool-heavy, multimodal, voice → GPT-5.5. Both input at $5/M.
How do I handle Gemini 3.1 Pro's 200K pricing tier?
Chunk your RAG pipeline to keep single requests under 200K, or use Gemini Context Caching for repeated documents. Past 200K, input doubles from $2 to $4.
Is DeepSeek V4 production-ready?
Strong default for teams in China and Chinese-language products going global. US and EU enterprises must weigh cross-border data rules, China's PIPL, and US federal restrictions on DeepSeek for sensitive workloads. Technically and on price it holds up—compliance is the variable.
How should an indie dev split a $50/mo budget?
DeepSeek V4 Pro as primary ($30), Gemini 2.5 Flash-Lite as backup ($10), reserve $10 for one Claude Sonnet call on the hard problems.
ChatGPT Plus / Claude Pro subscription vs API?
Under ~2 hours/day, subscriptions win for solo devs. Over ~4 hours/day or when embedding in your product, API is more flexible. Claude Code Max $100/mo ≈ 50 heavy Opus sessions.
Closing
Picking a model is step one. In 2026, the real gap is who can finish the agent job in a stable execution environment—build green, tests pass, PR merges.
Models are the brain; execution is the body. API spend is half the story—the other half is whether the machine running the agent stays up 24/7.
If you're on Claude Code for iOS / macOS, or need an agent node that doesn't drop overnight, lock in a Cloud Mac that can run till morning—then debate Fable vs Opus.
Agent dev: pick the model, secure the node
Vuncloud dedicated Mac mini M4 Cloud Mac: Claude Code marathons, Xcode build verification, overnight tmux jobs, US East/West/APAC nodes—give your agent a macOS body that stays online.
Related reading
- The Model Arms Race Is Over—Why Mac Compute Nodes Are Suddenly Hard to Get
- From Opus 4.8 to Fable 5: What Did Anthropic Actually Change?
- Is Mac Mini M4 Good for AI Development on a Cloud Mac? (2026)
Last updated: June 17, 2026. Pricing and benchmark data from vendor public rate cards and SWE-bench Verified leaderboard (June 2026).