Part of Series Inference Optimization Timeline 15 of 23
1 LLM Inference Fundamentals: Prefill, Decode, and the Memory-Compute Divide 2 KV Cache: The Hidden Memory Giant in LLM Serving 3 Quantization for LLM Inference: From FP16 to INT4 — A Deep Dive into Precision, Performance, and Production Deployment 4 FlashAttention: Why Tiling Attention Through the Memory Hierarchy Changes Everything 5 PagedAttention: How vLLM Borrowed OS Virtual Memory to Fix LLM Serving 6 Continuous Batching: The Complete Guide to LLM Inference Scheduling 7 Speculative Decoding: Why Autoregressive LLMs Leave 99% of Your GPU Idle and How to Fix It 8 Prefix Caching: RadixAttention, Cache Hierarchies, and Reusing Computation Across Requests 9 LoRA and QLoRA for Serving: Multi-Adapter Inference, S-LoRA, and When to Merge 10 Disaggregated Prefill-Decode: Why Splitting LLM Inference Changes Everything 11 Constrained Generation: FSM-Based Decoding, Outlines, and Grammar-Guided LLM Output 12 Mamba and State Space Models: The O(n) Alternative to Attention 13 Inference-Time Compute Scaling: When More Thinking Helps (o1, DeepSeek-R1, and the Reasoning Frontier) 14 CPU and Edge Inference: llama.cpp Internals, GGUF Format, and When CPU Actually Wins 15 Inference Cost Economics: Tokens per Dollar, GPU-Hours, and the Real Math of LLM Serving 16 Batched GEMM: Why Matrix Multiply Throughput Determines Everything in LLM Inference 17 Token Generation Pipeline: Logit Processing, Sampling Strategies, and Stop Criteria 18 Memory Pool Management: Slab Allocators for GPU Inference 19 Vision-Language Model Serving: ViT Encoding, Cross-Attention, and KV Cache Paging for Multimodal 20 Long-Context Serving: Ring Attention, KV Offloading, and Chunked Processing in Production 21 Inference Profiling: Nsight Systems, torch.profiler, and Finding Where Time Actually Goes 22 FP8 Inference: E4M3 Format, Per-Tensor Scaling, and the Hardware Support Matrix 23 Speculative Decoding v2: Medusa, EAGLE, Lookahead, and Token Tree Verification

Everyone talks about LLM inference performance in tokens per second. Nobody talks about the number that actually determines whether your product is viable: tokens per dollar. A system that generates 1,000 tokens per second at USD 10/hour costs USD 2.78 per million tokens. A system that generates 200 tokens per second at USD 1/hour costs USD 1.39 per million tokens. The slower system is cheaper by half.

This disconnect between speed and cost is pervasive in the LLM serving world. Engineers optimize for throughput benchmarks while finance teams stare at cloud bills. The gap between “fast” and “cheap” grows wider as you add batching, quantization, and hardware choices to the equation. Understanding the real math of inference economics is essential for anyone deploying LLMs at scale.

This post builds the complete cost model from first principles: the fundamental equation linking hardware cost to token cost, GPU pricing across clouds and purchase models, how each major optimization technique affects unit economics, the asymmetry between input and output token costs, the throughput-latency tradeoff frontier, when to self-host vs. use APIs, distillation as a cost lever, and where costs are heading with next-generation hardware.


1. The Fundamental Equation

Cost Per Token

Every inference cost ultimately reduces to one equation:

Cost per token=Cost per GPU-hourTokens per GPU-hour\text{Cost per token} = \frac{\text{Cost per GPU-hour}}{\text{Tokens per GPU-hour}}

This is a fraction with two levers: reduce the numerator (cheaper hardware) or increase the denominator (more tokens per hour from the same hardware). Every optimization technique in the LLM inference stack attacks one or both of these.

Expanding the denominator:

Tokens per GPU-hour=Tokens/sec/GPU×3600\text{Tokens per GPU-hour} = \text{Tokens/sec/GPU} \times 3600

And tokens per second depends on the model, quantization, batch size, hardware, and sequence length:

Tokens/sec=f(model_size,quant,batch_size,hardware,seq_len)\text{Tokens/sec} = f(\text{model\_size}, \text{quant}, \text{batch\_size}, \text{hardware}, \text{seq\_len})

There is no simple closed-form expression for this function — it depends on whether you are in the compute-bound or memory-bandwidth-bound regime, the efficiency of the serving software, and the workload characteristics. But we can derive useful approximations.

The Bandwidth-Bound Approximation

For decode (single-token generation), the throughput is approximately:

Decode tok/sMemory BW (GB/s)Model size (GB)/batch_size\text{Decode tok/s} \approx \frac{\text{Memory BW (GB/s)}}{\text{Model size (GB)} / \text{batch\_size}}

Wait — that is not quite right. More precisely, for batch size BB:

Decode tok/sB×Memory BWModel Size+B×KV cache per token×seq_len\text{Decode tok/s} \approx \frac{B \times \text{Memory BW}}{\text{Model Size} + B \times \text{KV cache per token} \times \text{seq\_len}}

For small batch sizes where the model weight read dominates, this simplifies to:

Decode tok/sB×Memory BWModel Size\text{Decode tok/s} \approx B \times \frac{\text{Memory BW}}{\text{Model Size}}

The throughput scales linearly with batch size until you hit the compute bound (tensor core saturation) or memory capacity limit (KV cache fills up VRAM).

The Key Insight

Decode throughput scales linearly with batch size in the memory-bandwidth-bound regime. Doubling the batch size doubles the throughput (tokens per second for the whole batch) without changing per-request latency, until you saturate compute or run out of VRAM. This is why batching is the single most important cost optimization.

The Compute-Bound Approximation

For prefill (processing input tokens), at large batch sizes the throughput is approximately:

Prefill tok/sFLOPS2×Nparams\text{Prefill tok/s} \approx \frac{\text{FLOPS}}{2 \times N_{\text{params}}}

where NparamsN_{\text{params}} is the model parameter count and the factor of 2 accounts for the multiply-add per parameter. This is independent of batch size once the GEMMs are large enough to saturate the tensor cores (typically at M128M \geq 128 for the batch dimension).

For Llama 70B on an H100 (989 TFLOPS FP16):

Prefill tok/s989×10122×70×1097,064 tok/s\text{Prefill tok/s} \approx \frac{989 \times 10^{12}}{2 \times 70 \times 10^9} \approx 7{,}064 \text{ tok/s}

This is the theoretical peak. In practice, FlashAttention overhead, memory access patterns, and kernel launch latency reduce this to ~4,000-5,000 tok/s.


2. GPU Economics: The Hardware Cost

Cloud GPU Pricing

GPU pricing varies dramatically across providers, instance types, and commitment levels. Here is the landscape as of early 2025:

📊

Cloud GPU Pricing Comparison (Early 2025)

GPUProviderInstanceOn-Demand ($/hr)Spot ($/hr)1yr Reserved ($/hr)
A100 80GB AWS p4d.24xlarge (8 GPU) USD 32.77 (USD 4.10/GPU) USD 12-18 USD 22.40 (USD 2.80/GPU)
A100 80GB GCP a2-highgpu-1g (1 GPU) USD 3.67 USD 1.10-1.80 USD 2.56
A100 80GB Lambda 1x A100 USD 1.10 N/A USD 0.85
H100 80GB AWS p5.48xlarge (8 GPU) USD 98.32 (USD 12.29/GPU) USD 35-50 USD 67.50 (USD 8.44/GPU)
H100 80GB GCP a3-highgpu-1g (1 GPU) USD 11.54 USD 3.50-5.00 USD 8.08
H100 80GB Lambda 1x H100 USD 2.49 N/A USD 1.89
H100 80GB CoreWeave 1x H100 SXM USD 2.23 N/A USD 1.79
RTX 4090 RunPod 1x 4090 USD 0.44 USD 0.31 N/A
Note: Prices change frequently. AWS multi-GPU instances include CPU, RAM, networking -- per-GPU price is approximate. Lambda/CoreWeave/RunPod are GPU-focused providers with lower overhead.

The pricing range for the same GPU (H100) spans from USD 2.23/hr (CoreWeave) to USD 12.29/hr (AWS on-demand). That is a 5.5x difference for identical hardware. The cloud provider choice alone can cut costs by 80%.

⚠️ Beware the Multi-GPU Tax

AWS and GCP often bundle GPUs into large instances (8 GPUs per instance). If you only need 1-2 GPUs, you pay for all 8. A p5.48xlarge at USD 98.32/hr costs USD 12.29/GPU, but if you only utilize 2 GPUs, your effective cost is USD 49.16/GPU. Smaller providers like Lambda and CoreWeave offer single-GPU instances, avoiding this waste.

On-Premises GPU Economics

For sustained workloads, purchasing GPUs can be dramatically cheaper than cloud rental:

Effective hourly cost=Purchase price+3-year opexHours in 3 years×utilization\text{Effective hourly cost} = \frac{\text{Purchase price} + \text{3-year opex}}{\text{Hours in 3 years} \times \text{utilization}}

For an H100:

  • Purchase price: ~USD 30,000
  • 3-year electricity (700W, USD 0.10/kWh): USD 18,400
  • 3-year hosting/cooling: ~USD 5,000
  • Total 3-year cost: ~USD 53,400
  • Hours in 3 years at 90% utilization: 23,652

Effective cost=53,40023,652\USD2.26/hr\text{Effective cost} = \frac{53{,}400}{23{,}652} \approx \USD 2.26/\text{hr}

This is comparable to the cheapest cloud providers — and you own the hardware. But on-premises requires upfront capital, operations expertise, and bears the risk of hardware depreciation (the H100 may be worth much less in 3 years when Blackwell-successor chips are available).

Effective $/GPU-Hour by Deployment Model (H100)

($/GPU-hour)
AWS on-demand Most expensive
12.29 $/GPU-hour
AWS reserved (1yr)
8.44 $/GPU-hour
GCP spot Variable availability
4.25 $/GPU-hour
Lambda on-demand
2.49 $/GPU-hour
CoreWeave reserved
1.79 $/GPU-hour
On-premises (3yr) Requires upfront capital
2.26 $/GPU-hour

3. Tokens Per Second Per Dollar: The Optimization Stack

How Each Optimization Affects Unit Economics

The cost per token depends on how many tokens per second you extract from each dollar of GPU cost. Here is how each major optimization technique contributes:

Quantization (2-4x cost reduction). Reducing model precision from FP16 to INT4 roughly halves the model size, doubling the effective memory bandwidth and (in the bandwidth-bound regime) doubling throughput. On hardware with INT4/INT8 tensor cores (H100, Ada Lovelace), quantization also increases compute throughput.

Speedup from quantizationBitsoriginalBitsquantized=164=4x (theoretical maximum)\text{Speedup from quantization} \approx \frac{\text{Bits}_{\text{original}}}{\text{Bits}_{\text{quantized}}} = \frac{16}{4} = 4\text{x (theoretical maximum)}

In practice, quantization overhead (dequantization, scale multiplication) and mixed-precision attention reduce this to 2-3x.

Batching (5-10x cost reduction). As discussed in Section 1, decode throughput scales linearly with batch size until compute saturation. Going from batch size 1 to batch size 32 can increase throughput by 20-30x while only increasing per-request latency by 2-3x (from KV cache memory pressure and scheduling overhead).

FlashAttention (1.3-2x cost reduction). FlashAttention reduces attention memory and compute overhead, enabling larger batch sizes (more KV cache fits in VRAM) and faster prefill. The direct speedup is modest (1.3-1.5x), but the indirect speedup from enabling larger batches can be 2x+.

Speculative decoding (2-3x cost reduction for latency). Speculative decoding reduces wall-clock time per token by verifying multiple draft tokens at once. However, it does not improve throughput (total tokens/sec across all requests) — it improves latency (time per token for individual requests). The cost benefit comes from enabling lower per-request latency at the same throughput, allowing you to meet stricter SLOs without overprovisioning.

📊

Cumulative Cost Impact of Optimization Stack (Llama 70B, H100)

ConfigurationThroughput (tok/s)Latency (ms/tok)Cost ($/M tok)Cost Reduction
FP16, batch=1, naive attention 34 29.4 USD 20.12 Baseline
+ INT4 quantization (AWQ) 78 12.8 USD 8.77 2.3x
+ FlashAttention v2 95 10.5 USD 7.20 2.8x
+ Continuous batching (bs=16) 680 23.5 USD 1.01 20x
+ Continuous batching (bs=64) 1,850 34.6 USD 0.37 54x
+ PagedAttention + optimized scheduling 2,400 26.7 USD 0.28 72x
Note: H100 at USD 2.49/hr (Lambda pricing). Latency is average time per output token. Throughput is total tokens/sec across all concurrent requests.

The fully optimized stack is 72x cheaper per token than the naive baseline. The single biggest lever is batching: going from batch size 1 to batch size 64 accounts for a 23x improvement. Quantization provides another 2.3x. FlashAttention and PagedAttention contribute the remaining efficiency gains.

Batching Dominates Everything

If you take away one insight from this post, let it be this: batching is the most important cost optimization in LLM serving. It is more impactful than quantization, more impactful than hardware choice, and more impactful than every attention optimization combined. A well-batched FP16 system at batch size 64 is cheaper per token than a poorly-batched INT4 system at batch size 4.


4. Prefill vs. Decode Cost Asymmetry

Why Input Tokens Are Cheaper

If you have used a commercial LLM API, you have noticed that input tokens are priced lower than output tokens — typically 3-10x lower:

📊

API Pricing: Input vs. Output Tokens (Early 2025)

Provider/ModelInput ($/M tok)Output ($/M tok)Output/Input Ratio
OpenAI GPT-4o USD 2.50 USD 10.00 4.0x
OpenAI GPT-4o-mini USD 0.15 USD 0.60 4.0x
Anthropic Claude 3.5 Sonnet USD 3.00 USD 15.00 5.0x
Google Gemini 1.5 Pro USD 1.25 USD 5.00 4.0x
DeepSeek V3 USD 0.07 USD 0.28 4.0x
Groq (Llama 70B) USD 0.59 USD 0.79 1.3x
Note: Prices as of early 2025. DeepSeek V3 pricing reflects their MoE architecture efficiency.

The pricing difference reflects the fundamental cost asymmetry between prefill and decode:

Prefill (processing input tokens) is compute-efficient. All input tokens are processed in parallel via large matrix multiplications. On an H100, Llama 70B prefill achieves ~4,000 tokens/sec — the tensor cores are well-utilized. The cost per input token is:

Prefill cost/token=\USD2.49/hr4,000 tok/s×3,600=\USD0.000000173\USD0.17/M tokens\text{Prefill cost/token} = \frac{\USD 2.49/\text{hr}}{4{,}000 \text{ tok/s} \times 3{,}600} = \USD 0.000000173 \approx \USD 0.17/\text{M tokens}

Decode (generating output tokens) is bandwidth-inefficient. Each output token requires reading the entire model from memory — the same weight traffic as processing hundreds of input tokens, but producing only one token. At batch size 32, Llama 70B decode achieves ~1,000 tokens/sec:

Decode cost/token=\USD2.49/hr1,000 tok/s×3,600=\USD0.000000692\USD0.69/M tokens\text{Decode cost/token} = \frac{\USD 2.49/\text{hr}}{1{,}000 \text{ tok/s} \times 3{,}600} = \USD 0.000000692 \approx \USD 0.69/\text{M tokens}

The decode cost is 4x higher than prefill, explaining the typical API pricing ratio.

Implications for Prompt Design

This cost asymmetry has practical implications:

  1. Long prompts are relatively cheap. A 10,000-token system prompt costs roughly USD 0.0017 per request (at USD 0.17/M tokens for prefill). This is negligible compared to generating even 500 output tokens (USD 0.00035 at USD 0.69/M tokens).

  2. Few-shot examples in prompts are cost-effective. Adding 10 examples at 200 tokens each (2,000 extra input tokens) costs ~USD 0.00034 but can significantly reduce the output length needed (the model gets the format right on the first try instead of generating verbose explanations).

  3. Reasoning models invert the ratio. A reasoning model generating 20,000 thinking tokens per query makes output tokens the dominant cost by 100x+. The input cost becomes completely negligible.

Cost Breakdown: Input vs. Output (Typical API Request)

(% of total cost)
Short Q&A (100 in, 50 out) Input: 29%, Output: 71%
29 % of total cost
Summarization (5K in, 500 out) Input: 63%, Output: 37%
63 % of total cost
+117.2%
Code gen (2K in, 2K out) Input: 20%, Output: 80%
20 % of total cost
Reasoning (1K in, 20K out) Input: 1%, Output: 99%
1 % of total cost

For summarization tasks (long input, short output), input tokens dominate the cost. For code generation and reasoning (moderate input, long output), output tokens dominate. Optimizing the expensive phase is what matters — there is no point in optimizing prefill for a reasoning workload.


5. The Batch Size Lever: Throughput vs. Latency

The Pareto Frontier

Batch size creates a fundamental tradeoff between throughput (tokens/sec for the whole system) and latency (time per token for each individual request). Larger batches improve throughput but increase latency because:

  1. Memory contention. More concurrent requests mean more KV cache in VRAM, reducing the memory available for other operations.
  2. Compute sharing. At very large batch sizes, the decode phase transitions from memory-bandwidth-bound to compute-bound, and adding more requests no longer increases throughput.
  3. Scheduling overhead. More requests mean more scheduling decisions, more preemption events, and more KV cache management overhead.
📊

Batch Size vs. Throughput and Latency (Llama 70B AWQ, H100)

Batch SizeThroughput (tok/s)Avg Latency (ms/tok)P99 Latency (ms/tok)Cost ($/M tok)
1 34 29 31 USD 20.12
4 128 31 35 USD 5.35
16 480 33 42 USD 1.43
32 880 36 52 USD 0.78
64 1,520 42 68 USD 0.45
128 2,100 61 95 USD 0.33
256 2,450 104 180 USD 0.28
Note: H100 at USD 2.49/hr. Throughput is total system tokens/sec. Latency is per-request decode time per token. At batch 256, we approach compute saturation.

The Pareto frontier is clearly visible:

  • Batch 1-16: Throughput scales nearly linearly. Latency barely increases. This is the “free lunch” regime — you are simply utilizing idle hardware capacity.
  • Batch 16-64: Throughput still scales well but latency starts climbing. The GPU is becoming well-utilized.
  • Batch 64-128: Throughput gains slow. Latency increases substantially. KV cache pressure forces some requests to queue.
  • Batch 128-256: Throughput plateaus. Latency degrades significantly. Diminishing returns.

Throughput-Latency Pareto Frontier (Llama 70B AWQ, H100)

(tok/s (throughput))
BS=1 (29ms lat) Low util
34 tok/s (throughput)
BS=4 (31ms lat)
128 tok/s (throughput)
+276.5%
BS=16 (33ms lat)
480 tok/s (throughput)
+1311.8%
BS=32 (36ms lat)
880 tok/s (throughput)
+2488.2%
BS=64 (42ms lat) Sweet spot
1,520 tok/s (throughput)
+4370.6%
BS=128 (61ms lat)
2,100 tok/s (throughput)
+6076.5%
BS=256 (104ms lat) Diminishing returns
2,450 tok/s (throughput)
+7105.9%

Choosing the Operating Point

The optimal batch size depends on your SLO:

  • Real-time chat (latency target: under 50ms/tok): Batch size 32-64. Cost ~USD 0.45-0.78/M tokens.
  • Near-real-time (latency target: under 100ms/tok): Batch size 128. Cost ~USD 0.33/M tokens.
  • Batch processing (no latency constraint): Maximum batch size. Cost ~USD 0.28/M tokens.

The cost difference between real-time and batch processing is “only” 1.6-2.8x. Many operators choose the real-time operating point even for batch workloads because the throughput is still high enough and the system can also serve interactive requests.

💡 The Practical Sweet Spot

For most production deployments, batch size 32-64 with continuous batching offers the best balance: 80-90% of maximum throughput at 40-50% of maximum latency. This is where most production vLLM/SGLang deployments operate.


6. Build vs. Buy: Self-Hosting vs. API

The Break-Even Analysis

The most common economic question in LLM deployment is: should we self-host (run our own GPU infrastructure) or use an API (pay per token)?

The answer depends on volume. APIs have zero fixed cost but high marginal cost per token. Self-hosting has high fixed cost (GPU rental) but low marginal cost per token.

API cost=V×Papi\text{API cost} = V \times P_{\text{api}}

Self-host cost=Cgpu×H+Cops\text{Self-host cost} = C_{\text{gpu}} \times H + C_{\text{ops}}

where VV is token volume, PapiP_{\text{api}} is API price per token, CgpuC_{\text{gpu}} is GPU hourly cost, HH is hours of GPU usage, and CopsC_{\text{ops}} is operations overhead (engineering time, monitoring, etc.).

The break-even point is where these are equal:

Vbreak-even=Cgpu×H+CopsPapiPself-hostV_{\text{break-even}} = \frac{C_{\text{gpu}} \times H + C_{\text{ops}}}{P_{\text{api}} - P_{\text{self-host}}}

Let us compute this for a concrete scenario: Llama 70B class model, comparing GPT-4o API vs. self-hosted on an H100.

API cost (GPT-4o): USD 10.00/M output tokens, USD 2.50/M input tokens. Assuming 50/50 input/output: USD 6.25/M tokens average.

Self-host cost (H100 on Lambda): USD 2.49/hr, achieving 1,500 tok/s at batch 64. Cost per million tokens: USD 2.49 / (1,500 x 3.6) = USD 0.46/M tokens. Plus engineering overhead: ~USD 5,000/month for a part-time ML engineer.

📊

Build vs. Buy Break-Even Analysis

Daily VolumeAPI Cost ($/month)Self-Host ($/month)WinnerSavings
100K tok/day USD 563 USD 1,800 + USD 5,000 API API saves USD 6,237
1M tok/day USD 5,625 USD 1,800 + USD 5,000 API API saves USD 1,175
5M tok/day USD 28,125 USD 1,800 + USD 5,000 Self-host Self-host saves USD 21,325
10M tok/day USD 56,250 USD 1,800 + USD 5,000 Self-host Self-host saves USD 49,450
50M tok/day USD 281,250 USD 2,700 + USD 5,000 Self-host Self-host saves USD 273,550
Note: API: GPT-4o blended pricing. Self-host: H100 on Lambda, 1 GPU handles up to ~10M tok/day at batch 64. Above 10M tok/day, add additional GPUs. Ops cost is conservative estimate.

The break-even is roughly at 3-5 million tokens per day. Below that, the API wins because the GPU sits partially idle and you are paying for engineering overhead that exceeds the API cost. Above that, self-hosting wins decisively — at 50M tokens/day, self-hosting saves USD 273,550/month.

ℹ️ The Engineering Overhead Factor

The break-even calculation is extremely sensitive to the operations cost estimate. A well-run team might manage GPU infrastructure for USD 5K/month in engineering time. A less experienced team might spend USD 20K/month debugging CUDA errors, managing deployments, and handling incidents. If your operations cost is high, the break-even point shifts dramatically rightward — you might need 20-30M tokens/day to justify self-hosting.

Hidden Costs of Self-Hosting

The GPU hourly rate is only part of the self-hosting cost. Hidden costs include:

  1. Model optimization engineering. Getting from naive inference to optimized (quantized, batched, tuned) inference requires significant ML engineering effort. Budget 2-4 engineer-weeks for initial setup.

  2. Monitoring and alerting. GPU utilization, latency percentiles, error rates, KV cache pressure, token throughput — all need dashboards and alerts. This is ongoing operational work.

  3. Scaling and load balancing. As traffic grows, you need auto-scaling, request routing, and capacity planning. Cloud APIs handle this transparently.

  4. Model updates. When a new model version is released, self-hosting requires re-downloading, re-quantizing, re-benchmarking, and re-deploying. APIs update automatically.

  5. GPU availability risk. Cloud GPU capacity is not guaranteed. Spot instances can be preempted. On-demand instances may be unavailable during high-demand periods. APIs abstract away this risk.

When APIs Win Even at High Volume

There are scenarios where APIs are worth the premium even at volumes above the break-even:

  • Multi-model deployments. If you use 5 different models for different tasks, self-hosting requires 5 separate GPU deployments. APIs let you switch between models per request.
  • Rapid iteration. During product development, you may switch models frequently. Self-hosting each candidate model is expensive; API calls let you A/B test instantly.
  • Compliance requirements. Some API providers offer SOC 2, HIPAA, and PCI-DSS compliance out of the box. Achieving the same compliance for self-hosted infrastructure is expensive.
  • Burst capacity. If your workload has sharp peaks (10x normal traffic for 2 hours per day), APIs absorb the burst without provisioning idle GPUs for the remaining 22 hours. Self-hosting for peak capacity means paying for GPUs that sit idle 90% of the time.

The Hybrid Approach

The most sophisticated operators use a hybrid strategy: self-host a baseline capacity that handles steady-state traffic, and overflow to APIs during peaks. This captures the cost benefit of self-hosting for predictable volume while avoiding overprovisioning for spikes.

For example, if steady-state is 10M tokens/day and peaks reach 40M tokens/day:

  • Self-host 1 H100 (handles ~13M tok/day at batch 64): USD 1,800/month
  • Overflow ~5M tok/day average to API during peaks: ~USD 940/month (at USD 6.25/M blended)
  • Total: USD 2,740/month

Compared to:

  • Pure self-host for peak (4 H100s): USD 7,200/month (3 GPUs idle most of the time)
  • Pure API: USD 18,750/month

The hybrid approach saves 62% vs. pure self-host and 85% vs. pure API for this workload pattern.

💡 The 80/20 Rule for Hybrid Serving

Self-host enough capacity to handle 80% of your traffic (the predictable baseline). Route the remaining 20% (burst traffic, long-tail models, experimental features) to APIs. This typically achieves 70-80% of the cost savings of full self-hosting with much less operational complexity.


7. The Distillation Option

Smaller Models at a Fraction of the Cost

Model distillation — training a smaller model to mimic a larger one — is the most underappreciated cost optimization in LLM serving. A distilled 7B model that achieves 85% of a 70B model’s quality runs at 10x the throughput on the same hardware, reducing cost per token by 10x.

The economics are compelling:

📊

Distillation Cost Impact (H100, Batch Size 64)

ModelSizeThroughput (tok/s)Cost ($/M tok)Quality (vs. 70B)
Llama 70B FP16 140 GB 1,520 USD 0.46 100% (baseline)
Llama 70B AWQ-4bit 35 GB 2,800 USD 0.25 97%
Distilled 13B FP16 26 GB 5,200 USD 0.13 88%
Distilled 13B AWQ-4bit 7 GB 9,800 USD 0.07 85%
Distilled 7B AWQ-4bit 3.5 GB 14,500 USD 0.048 78%
Distilled 1.5B AWQ-4bit 0.8 GB 32,000 USD 0.022 60%
Note: Quality percentages are approximate and task-dependent. Distilled models trained on teacher outputs from the 70B model.

The distilled 13B model at 4-bit quantization achieves 85% of the 70B model’s quality at USD 0.07/M tokens — 6.6x cheaper than the 70B model and comparable quality for many tasks. For applications where “good enough” quality suffices (first-draft generation, simple Q&A, classification, extraction), this is a massive cost saving.

When Distillation Works

Distillation works best when:

  1. The task is well-defined. Distilling for “general chat” preserves less quality than distilling for “SQL generation” or “medical triage classification.” Specialized distillation retains 90-95% of teacher quality.

  2. The output format is constrained. JSON extraction, classification, structured summarization — these tasks have well-defined output spaces where smaller models can match larger ones closely.

  3. The input distribution is known. Distillation on data similar to production inputs works much better than distillation on generic data.

💡 The Distillation Recipe

The highest-ROI cost optimization for most production LLM deployments: First, collect 10K-50K production inputs. Second, generate outputs with the best available model (teacher). Third, fine-tune a 7B-13B model on the teacher outputs. Fourth, quantize to INT4. Fifth, deploy. Expected cost reduction: 5-15x with 80-90% quality retention for the specific task.

The Cascade Pattern

A sophisticated deployment uses multiple model tiers:

  1. Tier 1 (cheap, fast): Distilled 1.5B model handles 60% of requests — simple queries, classification, extraction. Cost: USD 0.02/M tokens.
  2. Tier 2 (moderate): Distilled 7B model handles 30% of requests — moderate complexity, basic reasoning. Cost: USD 0.05/M tokens.
  3. Tier 3 (expensive, powerful): Full 70B model handles 10% of requests — complex reasoning, creative tasks. Cost: USD 0.46/M tokens.

A confidence-based router sends each request to the cheapest tier that can handle it. If the tier 1 model’s confidence is below a threshold, the request escalates to tier 2, and so on.

The blended cost: 0.02×0.6+0.05×0.3+0.46×0.1=0.012+0.015+0.046=0.0730.02 \times 0.6 + 0.05 \times 0.3 + 0.46 \times 0.1 = 0.012 + 0.015 + 0.046 = 0.073 USD per million tokens.

This is 6.3x cheaper than using the 70B model for everything, with minimal quality degradation (the hard queries still go to the best model).

Cascade Cost vs. Single-Model Deployment

($/M tokens)
70B for everything 100% quality
0.46 $/M tokens
7B for everything 78% quality
0.05 $/M tokens
Cascade (3 tiers) ~95% quality
0.073 $/M tokens
API (GPT-4o) 100% quality, 86x more
6.25 $/M tokens
+1258.7%

8. Cost Projections: Where Prices Are Heading

Hardware Trajectory

The cost of inference is driven by hardware economics, and the trajectory is strongly downward:

NVIDIA Blackwell (B100/B200, 2024-2025). Blackwell doubles HBM bandwidth over Hopper (8 TB/s vs. 3.35 TB/s for B200 vs. H100) and adds FP4 tensor cores. The combination yields ~3-4x better inference throughput per GPU for decode-bound workloads. At similar pricing to H100 (USD 2-3/hr), this translates to 3-4x lower cost per token.

FP4 quantization. Blackwell’s native FP4 support enables 4-bit inference without the quality loss of integer quantization. Early benchmarks show FP4 achieving quality within 1% of FP8, at half the memory and double the compute throughput. This is a 2x improvement over the INT4/FP8 that H100 supports.

AMD MI300X. AMD’s MI300X offers 192 GB HBM3 (2.4x the H100’s 80 GB) at competitive pricing (USD 1.50-2.50/hr). The larger memory enables bigger batch sizes without KV cache pressure, improving throughput by 30-50% for memory-constrained workloads. AMD’s software ecosystem (ROCm) is maturing but still lags CUDA.

📊

Projected Cost Per Million Tokens (Llama 70B Class, Optimized Serving)

YearPrimary HardwareKey ImprovementEst. Cost ($/M tok)vs. 2023
2023 A100 80GB Baseline USD 1.20 1.0x
2024 H100 80GB 2x BW, FP8 USD 0.45 2.7x cheaper
2025 B200 / MI300X 4x BW, FP4 USD 0.15 8x cheaper
2026 (est.) B300 / MI400 8x BW, FP4+ USD 0.06 20x cheaper
2027 (est.) Next-gen Algorithmic + HW USD 0.02 60x cheaper
Note: Estimates assume similar $/GPU-hour pricing, with cost reductions coming from throughput improvements. Algorithmic improvements (MoE, distillation, better attention) compound with hardware gains.

Algorithmic Trajectory

Hardware improvements compound with algorithmic advances:

Mixture of Experts (MoE). Models like Mixtral 8x7B and DeepSeek V3 (671B total, ~37B active) achieve quality comparable to dense models of similar “active” size but with much faster inference (only the active parameters are read from memory). MoE reduces the effective model size by 3-8x for inference purposes.

Better attention mechanisms. Multi-Query Attention, Grouped-Query Attention, and sliding window attention all reduce KV cache size, enabling larger batch sizes. Future architectures may reduce attention cost further.

Structured pruning. Removing entire attention heads or FFN dimensions that contribute minimally to output quality. This reduces model size without the precision loss of quantization.

Speculative decoding improvements. As draft model quality improves and self-speculative architectures (Medusa, EAGLE) mature, speculative decoding will reduce per-token latency by 2-4x with minimal throughput overhead. This does not directly reduce cost per token in throughput-optimized deployments, but it allows meeting stricter latency SLOs without overprovisioning — an indirect cost saving.

KV cache compression. Techniques like quantized KV cache (FP8 or INT4 for cached keys and values), attention sink pruning (H2O), and sliding window eviction can reduce KV cache memory by 2-4x. This enables larger batch sizes on the same hardware, directly improving throughput and reducing cost per token.

The combined effect of hardware + algorithmic improvements suggests that the cost of a Llama-70B-quality inference will drop by roughly 10x every 2 years. By 2027, the cost per million tokens for a frontier-quality model may be under USD 0.02 — approaching the cost of a web search.

The Price Elasticity of LLM Usage

Understanding cost trajectories matters because LLM usage is highly price-elastic. Empirical data from API providers suggests that a 50% price reduction leads to a 3-4x increase in token volume. This is because cheaper tokens unlock use cases that were previously uneconomical:

  • At USD 10/M tokens: only high-value, human-in-the-loop tasks justify LLM usage.
  • At USD 1/M tokens: automated pipelines (RAG, extraction, classification) become viable.
  • At USD 0.10/M tokens: agent loops (10-50 LLM calls per task) become affordable.
  • At USD 0.01/M tokens: continuous LLM processing (every email, every document, every message) becomes practical.

Each 10x cost reduction opens an entirely new tier of applications, each consuming 10-100x more tokens than the previous tier. This is why total inference spending is growing even as unit costs fall — the demand curve is steep.

Token Volume Growth vs. Price Reduction (Industry Aggregate)

(relative volume (2023 = 1x))
2023 (USD 1.20/M tok) Baseline
1 relative volume (2023 = 1x)
2024 (USD 0.45/M tok) 2.7x cheaper, 5.2x volume
5.2 relative volume (2023 = 1x)
+420.0%
2025 (USD 0.15/M tok) 8x cheaper, 28x volume
28 relative volume (2023 = 1x)
+2700.0%
2026E (USD 0.06/M tok) 20x cheaper, 120x volume
120 relative volume (2023 = 1x)
+11900.0%
2027E (USD 0.02/M tok) 60x cheaper, 500x volume
500 relative volume (2023 = 1x)
+49900.0%
ℹ️ The Jevons Paradox

As inference gets cheaper, usage increases faster than cost decreases. Cheaper tokens enable new use cases (reasoning models, agent loops, RAG with massive context, continuous summarization) that consume 10-100x more tokens per task. Total spending on inference is likely to increase even as the unit cost plummets. The inference cost problem does not go away — it shapeshifts.


9. Putting It All Together: A Decision Framework

Step 1: Estimate Your Volume

  • Under 100K tokens/day: Use APIs. Do not think about infrastructure.
  • 100K-5M tokens/day: Use APIs but optimize prompts. Consider distillation for specific high-volume tasks.
  • 5M-50M tokens/day: Self-hosting likely wins. Start with managed GPU providers (Lambda, CoreWeave, RunPod).
  • Over 50M tokens/day: Definitely self-host. Consider on-premises for sustained workloads.

Step 2: Choose Your Quality-Cost Point

  • Maximum quality needed: Use the largest model available. Accept the cost.
  • “Good enough” quality for a defined task: Distill a 7B-13B model. 5-15x cheaper.
  • Cascade: Route by difficulty. Best quality-cost tradeoff for mixed workloads.

Step 3: Optimize the Serving Stack

For self-hosting, the priority order of optimizations:

  1. Continuous batching (5-10x improvement). Use vLLM, SGLang, or TensorRT-LLM — not naive batch inference.
  2. Quantization (2-3x improvement). AWQ or GPTQ for GPU, GGUF for CPU. Start with INT4.
  3. Model selection (2-10x improvement). Use the smallest model that meets your quality bar. A distilled 13B is better economics than an undistilled 70B for most tasks.
  4. Hardware selection (2-5x improvement). Choose the right GPU for your workload. H100 for throughput, A100 for cost-efficiency, Apple Silicon for edge.
  5. Attention optimization (1.3-2x improvement). FlashAttention, PagedAttention, and KV cache compression.

Step 4: Monitor and Iterate

Track these metrics continuously:

  • Cost per million tokens (the bottom line)
  • GPU utilization (aim for over 70% during serving hours)
  • Latency percentiles (P50, P95, P99 — ensure SLOs are met)
  • Token volume (watch for growth that might cross a break-even threshold)
  • Quality metrics (accuracy on evaluation set, user satisfaction — ensure optimizations have not degraded quality)
📊

Summary: Cost Per Million Tokens Across Deployment Strategies

StrategySetup EffortOngoing EffortCost Range ($/M tok)Best For
API (frontier model) Minutes None USD 2.00-15.00 Prototyping, low volume
API (budget model) Minutes None USD 0.15-0.60 Medium volume, diverse tasks
Self-host (70B, optimized) Weeks Moderate USD 0.25-0.50 High volume, quality-critical
Self-host (distilled 7B) Weeks + training Moderate USD 0.04-0.08 High volume, defined task
Cascade (multi-tier) Months Significant USD 0.05-0.10 Very high volume, mixed tasks
On-prem (owned HW) Months Significant USD 0.15-0.30 Sustained ultra-high volume
Note: Cost ranges assume reasonable optimization effort. Actual costs vary significantly with workload characteristics, GPU pricing, and engineering efficiency.

The gap between the most expensive strategy (USD 15/M tokens for frontier API) and the cheapest (USD 0.04/M tokens for self-hosted distilled model) is 375x. This is not a rounding error — it is the difference between a viable product and an unaffordable one. Understanding inference economics is not optional for anyone building on LLMs.

Step 5: Plan for Cost Evolution

The cost landscape changes rapidly. Decisions that are optimal today may be suboptimal in 6 months. Build your architecture for flexibility:

  • Abstract the model layer. Use an internal API that can route to self-hosted or external models. When prices change or new models launch, you can switch without modifying application code.
  • Re-evaluate quarterly. GPU pricing, model quality, and API pricing all shift frequently. A quarterly review of your cost model ensures you capture savings.
  • Avoid long-term GPU commitments unless volume is stable. One-year reserved instances save 30-40% but lock you in. If a new GPU generation launches mid-commitment, you are stuck on outdated hardware. Prefer 3-month commitments or spot instances for variable workloads.
  • Track cost per useful token. Not all generated tokens are useful. Reasoning traces, rejected speculative tokens, and retried generations all cost money but produce no user-facing value. The metric that matters is cost per useful output token delivered to the user.
⚠️ The Hidden Cost of Prompt Bloat

As applications evolve, system prompts tend to grow — more instructions, more examples, more context. A system prompt that grew from 500 tokens to 5,000 tokens over 6 months adds USD 0.00075 per request in prefill cost. At 1 million requests per day, that is USD 22,500 per month in hidden cost from prompt creep alone. Audit your prompts regularly.


10. Conclusion

The economics of LLM inference are dominated by a few key relationships:

  1. Batching is king. The single most impactful optimization, providing 5-30x cost reduction by amortizing the memory bandwidth cost of weight loading across multiple requests.

  2. Output tokens cost 3-5x more than input tokens. Optimizing output length (through better prompting, constrained decoding, or distillation) has outsized cost impact.

  3. The build-vs-buy crossover is around 3-5M tokens per day. Below this, APIs win on total cost of ownership. Above this, self-hosting wins — and the margin widens rapidly with volume.

  4. Distillation is the most underappreciated lever. A task-specific 7B model at USD 0.05/M tokens can replace a general 70B model at USD 0.46/M tokens for the vast majority of production workloads.

  5. Costs are falling 10x every 2 years. Hardware improvements (HBM bandwidth, FP4 tensor cores) compound with algorithmic improvements (MoE, better attention, KV cache compression) to drive relentless cost reduction.

The cost of LLM inference is dropping rapidly, but the demands on inference are growing even faster. Reasoning models that generate 50,000 tokens per query, agent loops that make dozens of LLM calls per task, and always-on processing pipelines that consume tokens continuously — all of these push total spending upward even as unit costs plummet. The winners will be those who understand the math well enough to operate on the efficient frontier — squeezing maximum quality from every dollar of compute.