Everyone talks about LLM inference performance in tokens per second. Nobody talks about the number that actually determines whether your product is viable: tokens per dollar. A system that generates 1,000 tokens per second at USD 10/hour costs USD 2.78 per million tokens. A system that generates 200 tokens per second at USD 1/hour costs USD 1.39 per million tokens. The slower system is cheaper by half.
This disconnect between speed and cost is pervasive in the LLM serving world. Engineers optimize for throughput benchmarks while finance teams stare at cloud bills. The gap between “fast” and “cheap” grows wider as you add batching, quantization, and hardware choices to the equation. Understanding the real math of inference economics is essential for anyone deploying LLMs at scale.
This post builds the complete cost model from first principles: the fundamental equation linking hardware cost to token cost, GPU pricing across clouds and purchase models, how each major optimization technique affects unit economics, the asymmetry between input and output token costs, the throughput-latency tradeoff frontier, when to self-host vs. use APIs, distillation as a cost lever, and where costs are heading with next-generation hardware.
1. The Fundamental Equation
Cost Per Token
Every inference cost ultimately reduces to one equation:
This is a fraction with two levers: reduce the numerator (cheaper hardware) or increase the denominator (more tokens per hour from the same hardware). Every optimization technique in the LLM inference stack attacks one or both of these.
Expanding the denominator:
And tokens per second depends on the model, quantization, batch size, hardware, and sequence length:
There is no simple closed-form expression for this function — it depends on whether you are in the compute-bound or memory-bandwidth-bound regime, the efficiency of the serving software, and the workload characteristics. But we can derive useful approximations.
The Bandwidth-Bound Approximation
For decode (single-token generation), the throughput is approximately:
Wait — that is not quite right. More precisely, for batch size :
For small batch sizes where the model weight read dominates, this simplifies to:
The throughput scales linearly with batch size until you hit the compute bound (tensor core saturation) or memory capacity limit (KV cache fills up VRAM).
Decode throughput scales linearly with batch size in the memory-bandwidth-bound regime. Doubling the batch size doubles the throughput (tokens per second for the whole batch) without changing per-request latency, until you saturate compute or run out of VRAM. This is why batching is the single most important cost optimization.
The Compute-Bound Approximation
For prefill (processing input tokens), at large batch sizes the throughput is approximately:
where is the model parameter count and the factor of 2 accounts for the multiply-add per parameter. This is independent of batch size once the GEMMs are large enough to saturate the tensor cores (typically at for the batch dimension).
For Llama 70B on an H100 (989 TFLOPS FP16):
This is the theoretical peak. In practice, FlashAttention overhead, memory access patterns, and kernel launch latency reduce this to ~4,000-5,000 tok/s.
2. GPU Economics: The Hardware Cost
Cloud GPU Pricing
GPU pricing varies dramatically across providers, instance types, and commitment levels. Here is the landscape as of early 2025:
Cloud GPU Pricing Comparison (Early 2025)
| GPU | Provider | Instance | On-Demand ($/hr) | Spot ($/hr) | 1yr Reserved ($/hr) |
|---|---|---|---|---|---|
| A100 80GB | AWS | p4d.24xlarge (8 GPU) | USD 32.77 (USD 4.10/GPU) | USD 12-18 | USD 22.40 (USD 2.80/GPU) |
| A100 80GB | GCP | a2-highgpu-1g (1 GPU) | USD 3.67 | USD 1.10-1.80 | USD 2.56 |
| A100 80GB | Lambda | 1x A100 | USD 1.10 | N/A | USD 0.85 |
| H100 80GB | AWS | p5.48xlarge (8 GPU) | USD 98.32 (USD 12.29/GPU) | USD 35-50 | USD 67.50 (USD 8.44/GPU) |
| H100 80GB | GCP | a3-highgpu-1g (1 GPU) | USD 11.54 | USD 3.50-5.00 | USD 8.08 |
| H100 80GB | Lambda | 1x H100 | USD 2.49 | N/A | USD 1.89 |
| H100 80GB | CoreWeave | 1x H100 SXM | USD 2.23 | N/A | USD 1.79 |
| RTX 4090 | RunPod | 1x 4090 | USD 0.44 | USD 0.31 | N/A |
The pricing range for the same GPU (H100) spans from USD 2.23/hr (CoreWeave) to USD 12.29/hr (AWS on-demand). That is a 5.5x difference for identical hardware. The cloud provider choice alone can cut costs by 80%.
AWS and GCP often bundle GPUs into large instances (8 GPUs per instance). If you only need 1-2 GPUs, you pay for all 8. A p5.48xlarge at USD 98.32/hr costs USD 12.29/GPU, but if you only utilize 2 GPUs, your effective cost is USD 49.16/GPU. Smaller providers like Lambda and CoreWeave offer single-GPU instances, avoiding this waste.
On-Premises GPU Economics
For sustained workloads, purchasing GPUs can be dramatically cheaper than cloud rental:
For an H100:
- Purchase price: ~USD 30,000
- 3-year electricity (700W, USD 0.10/kWh): USD 18,400
- 3-year hosting/cooling: ~USD 5,000
- Total 3-year cost: ~USD 53,400
- Hours in 3 years at 90% utilization: 23,652
This is comparable to the cheapest cloud providers — and you own the hardware. But on-premises requires upfront capital, operations expertise, and bears the risk of hardware depreciation (the H100 may be worth much less in 3 years when Blackwell-successor chips are available).
Effective $/GPU-Hour by Deployment Model (H100)
($/GPU-hour)3. Tokens Per Second Per Dollar: The Optimization Stack
How Each Optimization Affects Unit Economics
The cost per token depends on how many tokens per second you extract from each dollar of GPU cost. Here is how each major optimization technique contributes:
Quantization (2-4x cost reduction). Reducing model precision from FP16 to INT4 roughly halves the model size, doubling the effective memory bandwidth and (in the bandwidth-bound regime) doubling throughput. On hardware with INT4/INT8 tensor cores (H100, Ada Lovelace), quantization also increases compute throughput.
In practice, quantization overhead (dequantization, scale multiplication) and mixed-precision attention reduce this to 2-3x.
Batching (5-10x cost reduction). As discussed in Section 1, decode throughput scales linearly with batch size until compute saturation. Going from batch size 1 to batch size 32 can increase throughput by 20-30x while only increasing per-request latency by 2-3x (from KV cache memory pressure and scheduling overhead).
FlashAttention (1.3-2x cost reduction). FlashAttention reduces attention memory and compute overhead, enabling larger batch sizes (more KV cache fits in VRAM) and faster prefill. The direct speedup is modest (1.3-1.5x), but the indirect speedup from enabling larger batches can be 2x+.
Speculative decoding (2-3x cost reduction for latency). Speculative decoding reduces wall-clock time per token by verifying multiple draft tokens at once. However, it does not improve throughput (total tokens/sec across all requests) — it improves latency (time per token for individual requests). The cost benefit comes from enabling lower per-request latency at the same throughput, allowing you to meet stricter SLOs without overprovisioning.
Cumulative Cost Impact of Optimization Stack (Llama 70B, H100)
| Configuration | Throughput (tok/s) | Latency (ms/tok) | Cost ($/M tok) | Cost Reduction |
|---|---|---|---|---|
| FP16, batch=1, naive attention | 34 | 29.4 | USD 20.12 | Baseline |
| + INT4 quantization (AWQ) | 78 | 12.8 | USD 8.77 | 2.3x |
| + FlashAttention v2 | 95 | 10.5 | USD 7.20 | 2.8x |
| + Continuous batching (bs=16) | 680 | 23.5 | USD 1.01 | 20x |
| + Continuous batching (bs=64) | 1,850 | 34.6 | USD 0.37 | 54x |
| + PagedAttention + optimized scheduling | 2,400 | 26.7 | USD 0.28 | 72x |
The fully optimized stack is 72x cheaper per token than the naive baseline. The single biggest lever is batching: going from batch size 1 to batch size 64 accounts for a 23x improvement. Quantization provides another 2.3x. FlashAttention and PagedAttention contribute the remaining efficiency gains.
If you take away one insight from this post, let it be this: batching is the most important cost optimization in LLM serving. It is more impactful than quantization, more impactful than hardware choice, and more impactful than every attention optimization combined. A well-batched FP16 system at batch size 64 is cheaper per token than a poorly-batched INT4 system at batch size 4.
4. Prefill vs. Decode Cost Asymmetry
Why Input Tokens Are Cheaper
If you have used a commercial LLM API, you have noticed that input tokens are priced lower than output tokens — typically 3-10x lower:
API Pricing: Input vs. Output Tokens (Early 2025)
| Provider/Model | Input ($/M tok) | Output ($/M tok) | Output/Input Ratio |
|---|---|---|---|
| OpenAI GPT-4o | USD 2.50 | USD 10.00 | 4.0x |
| OpenAI GPT-4o-mini | USD 0.15 | USD 0.60 | 4.0x |
| Anthropic Claude 3.5 Sonnet | USD 3.00 | USD 15.00 | 5.0x |
| Google Gemini 1.5 Pro | USD 1.25 | USD 5.00 | 4.0x |
| DeepSeek V3 | USD 0.07 | USD 0.28 | 4.0x |
| Groq (Llama 70B) | USD 0.59 | USD 0.79 | 1.3x |
The pricing difference reflects the fundamental cost asymmetry between prefill and decode:
Prefill (processing input tokens) is compute-efficient. All input tokens are processed in parallel via large matrix multiplications. On an H100, Llama 70B prefill achieves ~4,000 tokens/sec — the tensor cores are well-utilized. The cost per input token is:
Decode (generating output tokens) is bandwidth-inefficient. Each output token requires reading the entire model from memory — the same weight traffic as processing hundreds of input tokens, but producing only one token. At batch size 32, Llama 70B decode achieves ~1,000 tokens/sec:
The decode cost is 4x higher than prefill, explaining the typical API pricing ratio.
Implications for Prompt Design
This cost asymmetry has practical implications:
-
Long prompts are relatively cheap. A 10,000-token system prompt costs roughly USD 0.0017 per request (at USD 0.17/M tokens for prefill). This is negligible compared to generating even 500 output tokens (USD 0.00035 at USD 0.69/M tokens).
-
Few-shot examples in prompts are cost-effective. Adding 10 examples at 200 tokens each (2,000 extra input tokens) costs ~USD 0.00034 but can significantly reduce the output length needed (the model gets the format right on the first try instead of generating verbose explanations).
-
Reasoning models invert the ratio. A reasoning model generating 20,000 thinking tokens per query makes output tokens the dominant cost by 100x+. The input cost becomes completely negligible.
Cost Breakdown: Input vs. Output (Typical API Request)
(% of total cost)For summarization tasks (long input, short output), input tokens dominate the cost. For code generation and reasoning (moderate input, long output), output tokens dominate. Optimizing the expensive phase is what matters — there is no point in optimizing prefill for a reasoning workload.
5. The Batch Size Lever: Throughput vs. Latency
The Pareto Frontier
Batch size creates a fundamental tradeoff between throughput (tokens/sec for the whole system) and latency (time per token for each individual request). Larger batches improve throughput but increase latency because:
- Memory contention. More concurrent requests mean more KV cache in VRAM, reducing the memory available for other operations.
- Compute sharing. At very large batch sizes, the decode phase transitions from memory-bandwidth-bound to compute-bound, and adding more requests no longer increases throughput.
- Scheduling overhead. More requests mean more scheduling decisions, more preemption events, and more KV cache management overhead.
Batch Size vs. Throughput and Latency (Llama 70B AWQ, H100)
| Batch Size | Throughput (tok/s) | Avg Latency (ms/tok) | P99 Latency (ms/tok) | Cost ($/M tok) |
|---|---|---|---|---|
| 1 | 34 | 29 | 31 | USD 20.12 |
| 4 | 128 | 31 | 35 | USD 5.35 |
| 16 | 480 | 33 | 42 | USD 1.43 |
| 32 | 880 | 36 | 52 | USD 0.78 |
| 64 | 1,520 | 42 | 68 | USD 0.45 |
| 128 | 2,100 | 61 | 95 | USD 0.33 |
| 256 | 2,450 | 104 | 180 | USD 0.28 |
The Pareto frontier is clearly visible:
- Batch 1-16: Throughput scales nearly linearly. Latency barely increases. This is the “free lunch” regime — you are simply utilizing idle hardware capacity.
- Batch 16-64: Throughput still scales well but latency starts climbing. The GPU is becoming well-utilized.
- Batch 64-128: Throughput gains slow. Latency increases substantially. KV cache pressure forces some requests to queue.
- Batch 128-256: Throughput plateaus. Latency degrades significantly. Diminishing returns.
Throughput-Latency Pareto Frontier (Llama 70B AWQ, H100)
(tok/s (throughput))Choosing the Operating Point
The optimal batch size depends on your SLO:
- Real-time chat (latency target: under 50ms/tok): Batch size 32-64. Cost ~USD 0.45-0.78/M tokens.
- Near-real-time (latency target: under 100ms/tok): Batch size 128. Cost ~USD 0.33/M tokens.
- Batch processing (no latency constraint): Maximum batch size. Cost ~USD 0.28/M tokens.
The cost difference between real-time and batch processing is “only” 1.6-2.8x. Many operators choose the real-time operating point even for batch workloads because the throughput is still high enough and the system can also serve interactive requests.
For most production deployments, batch size 32-64 with continuous batching offers the best balance: 80-90% of maximum throughput at 40-50% of maximum latency. This is where most production vLLM/SGLang deployments operate.
6. Build vs. Buy: Self-Hosting vs. API
The Break-Even Analysis
The most common economic question in LLM deployment is: should we self-host (run our own GPU infrastructure) or use an API (pay per token)?
The answer depends on volume. APIs have zero fixed cost but high marginal cost per token. Self-hosting has high fixed cost (GPU rental) but low marginal cost per token.
where is token volume, is API price per token, is GPU hourly cost, is hours of GPU usage, and is operations overhead (engineering time, monitoring, etc.).
The break-even point is where these are equal:
Let us compute this for a concrete scenario: Llama 70B class model, comparing GPT-4o API vs. self-hosted on an H100.
API cost (GPT-4o): USD 10.00/M output tokens, USD 2.50/M input tokens. Assuming 50/50 input/output: USD 6.25/M tokens average.
Self-host cost (H100 on Lambda): USD 2.49/hr, achieving 1,500 tok/s at batch 64. Cost per million tokens: USD 2.49 / (1,500 x 3.6) = USD 0.46/M tokens. Plus engineering overhead: ~USD 5,000/month for a part-time ML engineer.
Build vs. Buy Break-Even Analysis
| Daily Volume | API Cost ($/month) | Self-Host ($/month) | Winner | Savings |
|---|---|---|---|---|
| 100K tok/day | USD 563 | USD 1,800 + USD 5,000 | API | API saves USD 6,237 |
| 1M tok/day | USD 5,625 | USD 1,800 + USD 5,000 | API | API saves USD 1,175 |
| 5M tok/day | USD 28,125 | USD 1,800 + USD 5,000 | Self-host | Self-host saves USD 21,325 |
| 10M tok/day | USD 56,250 | USD 1,800 + USD 5,000 | Self-host | Self-host saves USD 49,450 |
| 50M tok/day | USD 281,250 | USD 2,700 + USD 5,000 | Self-host | Self-host saves USD 273,550 |
The break-even is roughly at 3-5 million tokens per day. Below that, the API wins because the GPU sits partially idle and you are paying for engineering overhead that exceeds the API cost. Above that, self-hosting wins decisively — at 50M tokens/day, self-hosting saves USD 273,550/month.
The break-even calculation is extremely sensitive to the operations cost estimate. A well-run team might manage GPU infrastructure for USD 5K/month in engineering time. A less experienced team might spend USD 20K/month debugging CUDA errors, managing deployments, and handling incidents. If your operations cost is high, the break-even point shifts dramatically rightward — you might need 20-30M tokens/day to justify self-hosting.
Hidden Costs of Self-Hosting
The GPU hourly rate is only part of the self-hosting cost. Hidden costs include:
-
Model optimization engineering. Getting from naive inference to optimized (quantized, batched, tuned) inference requires significant ML engineering effort. Budget 2-4 engineer-weeks for initial setup.
-
Monitoring and alerting. GPU utilization, latency percentiles, error rates, KV cache pressure, token throughput — all need dashboards and alerts. This is ongoing operational work.
-
Scaling and load balancing. As traffic grows, you need auto-scaling, request routing, and capacity planning. Cloud APIs handle this transparently.
-
Model updates. When a new model version is released, self-hosting requires re-downloading, re-quantizing, re-benchmarking, and re-deploying. APIs update automatically.
-
GPU availability risk. Cloud GPU capacity is not guaranteed. Spot instances can be preempted. On-demand instances may be unavailable during high-demand periods. APIs abstract away this risk.
When APIs Win Even at High Volume
There are scenarios where APIs are worth the premium even at volumes above the break-even:
- Multi-model deployments. If you use 5 different models for different tasks, self-hosting requires 5 separate GPU deployments. APIs let you switch between models per request.
- Rapid iteration. During product development, you may switch models frequently. Self-hosting each candidate model is expensive; API calls let you A/B test instantly.
- Compliance requirements. Some API providers offer SOC 2, HIPAA, and PCI-DSS compliance out of the box. Achieving the same compliance for self-hosted infrastructure is expensive.
- Burst capacity. If your workload has sharp peaks (10x normal traffic for 2 hours per day), APIs absorb the burst without provisioning idle GPUs for the remaining 22 hours. Self-hosting for peak capacity means paying for GPUs that sit idle 90% of the time.
The Hybrid Approach
The most sophisticated operators use a hybrid strategy: self-host a baseline capacity that handles steady-state traffic, and overflow to APIs during peaks. This captures the cost benefit of self-hosting for predictable volume while avoiding overprovisioning for spikes.
For example, if steady-state is 10M tokens/day and peaks reach 40M tokens/day:
- Self-host 1 H100 (handles ~13M tok/day at batch 64): USD 1,800/month
- Overflow ~5M tok/day average to API during peaks: ~USD 940/month (at USD 6.25/M blended)
- Total: USD 2,740/month
Compared to:
- Pure self-host for peak (4 H100s): USD 7,200/month (3 GPUs idle most of the time)
- Pure API: USD 18,750/month
The hybrid approach saves 62% vs. pure self-host and 85% vs. pure API for this workload pattern.
Self-host enough capacity to handle 80% of your traffic (the predictable baseline). Route the remaining 20% (burst traffic, long-tail models, experimental features) to APIs. This typically achieves 70-80% of the cost savings of full self-hosting with much less operational complexity.
7. The Distillation Option
Smaller Models at a Fraction of the Cost
Model distillation — training a smaller model to mimic a larger one — is the most underappreciated cost optimization in LLM serving. A distilled 7B model that achieves 85% of a 70B model’s quality runs at 10x the throughput on the same hardware, reducing cost per token by 10x.
The economics are compelling:
Distillation Cost Impact (H100, Batch Size 64)
| Model | Size | Throughput (tok/s) | Cost ($/M tok) | Quality (vs. 70B) |
|---|---|---|---|---|
| Llama 70B FP16 | 140 GB | 1,520 | USD 0.46 | 100% (baseline) |
| Llama 70B AWQ-4bit | 35 GB | 2,800 | USD 0.25 | 97% |
| Distilled 13B FP16 | 26 GB | 5,200 | USD 0.13 | 88% |
| Distilled 13B AWQ-4bit | 7 GB | 9,800 | USD 0.07 | 85% |
| Distilled 7B AWQ-4bit | 3.5 GB | 14,500 | USD 0.048 | 78% |
| Distilled 1.5B AWQ-4bit | 0.8 GB | 32,000 | USD 0.022 | 60% |
The distilled 13B model at 4-bit quantization achieves 85% of the 70B model’s quality at USD 0.07/M tokens — 6.6x cheaper than the 70B model and comparable quality for many tasks. For applications where “good enough” quality suffices (first-draft generation, simple Q&A, classification, extraction), this is a massive cost saving.
When Distillation Works
Distillation works best when:
-
The task is well-defined. Distilling for “general chat” preserves less quality than distilling for “SQL generation” or “medical triage classification.” Specialized distillation retains 90-95% of teacher quality.
-
The output format is constrained. JSON extraction, classification, structured summarization — these tasks have well-defined output spaces where smaller models can match larger ones closely.
-
The input distribution is known. Distillation on data similar to production inputs works much better than distillation on generic data.
The highest-ROI cost optimization for most production LLM deployments: First, collect 10K-50K production inputs. Second, generate outputs with the best available model (teacher). Third, fine-tune a 7B-13B model on the teacher outputs. Fourth, quantize to INT4. Fifth, deploy. Expected cost reduction: 5-15x with 80-90% quality retention for the specific task.
The Cascade Pattern
A sophisticated deployment uses multiple model tiers:
- Tier 1 (cheap, fast): Distilled 1.5B model handles 60% of requests — simple queries, classification, extraction. Cost: USD 0.02/M tokens.
- Tier 2 (moderate): Distilled 7B model handles 30% of requests — moderate complexity, basic reasoning. Cost: USD 0.05/M tokens.
- Tier 3 (expensive, powerful): Full 70B model handles 10% of requests — complex reasoning, creative tasks. Cost: USD 0.46/M tokens.
A confidence-based router sends each request to the cheapest tier that can handle it. If the tier 1 model’s confidence is below a threshold, the request escalates to tier 2, and so on.
The blended cost: USD per million tokens.
This is 6.3x cheaper than using the 70B model for everything, with minimal quality degradation (the hard queries still go to the best model).
Cascade Cost vs. Single-Model Deployment
($/M tokens)8. Cost Projections: Where Prices Are Heading
Hardware Trajectory
The cost of inference is driven by hardware economics, and the trajectory is strongly downward:
NVIDIA Blackwell (B100/B200, 2024-2025). Blackwell doubles HBM bandwidth over Hopper (8 TB/s vs. 3.35 TB/s for B200 vs. H100) and adds FP4 tensor cores. The combination yields ~3-4x better inference throughput per GPU for decode-bound workloads. At similar pricing to H100 (USD 2-3/hr), this translates to 3-4x lower cost per token.
FP4 quantization. Blackwell’s native FP4 support enables 4-bit inference without the quality loss of integer quantization. Early benchmarks show FP4 achieving quality within 1% of FP8, at half the memory and double the compute throughput. This is a 2x improvement over the INT4/FP8 that H100 supports.
AMD MI300X. AMD’s MI300X offers 192 GB HBM3 (2.4x the H100’s 80 GB) at competitive pricing (USD 1.50-2.50/hr). The larger memory enables bigger batch sizes without KV cache pressure, improving throughput by 30-50% for memory-constrained workloads. AMD’s software ecosystem (ROCm) is maturing but still lags CUDA.
Projected Cost Per Million Tokens (Llama 70B Class, Optimized Serving)
| Year | Primary Hardware | Key Improvement | Est. Cost ($/M tok) | vs. 2023 |
|---|---|---|---|---|
| 2023 | A100 80GB | Baseline | USD 1.20 | 1.0x |
| 2024 | H100 80GB | 2x BW, FP8 | USD 0.45 | 2.7x cheaper |
| 2025 | B200 / MI300X | 4x BW, FP4 | USD 0.15 | 8x cheaper |
| 2026 (est.) | B300 / MI400 | 8x BW, FP4+ | USD 0.06 | 20x cheaper |
| 2027 (est.) | Next-gen | Algorithmic + HW | USD 0.02 | 60x cheaper |
Algorithmic Trajectory
Hardware improvements compound with algorithmic advances:
Mixture of Experts (MoE). Models like Mixtral 8x7B and DeepSeek V3 (671B total, ~37B active) achieve quality comparable to dense models of similar “active” size but with much faster inference (only the active parameters are read from memory). MoE reduces the effective model size by 3-8x for inference purposes.
Better attention mechanisms. Multi-Query Attention, Grouped-Query Attention, and sliding window attention all reduce KV cache size, enabling larger batch sizes. Future architectures may reduce attention cost further.
Structured pruning. Removing entire attention heads or FFN dimensions that contribute minimally to output quality. This reduces model size without the precision loss of quantization.
Speculative decoding improvements. As draft model quality improves and self-speculative architectures (Medusa, EAGLE) mature, speculative decoding will reduce per-token latency by 2-4x with minimal throughput overhead. This does not directly reduce cost per token in throughput-optimized deployments, but it allows meeting stricter latency SLOs without overprovisioning — an indirect cost saving.
KV cache compression. Techniques like quantized KV cache (FP8 or INT4 for cached keys and values), attention sink pruning (H2O), and sliding window eviction can reduce KV cache memory by 2-4x. This enables larger batch sizes on the same hardware, directly improving throughput and reducing cost per token.
The combined effect of hardware + algorithmic improvements suggests that the cost of a Llama-70B-quality inference will drop by roughly 10x every 2 years. By 2027, the cost per million tokens for a frontier-quality model may be under USD 0.02 — approaching the cost of a web search.
The Price Elasticity of LLM Usage
Understanding cost trajectories matters because LLM usage is highly price-elastic. Empirical data from API providers suggests that a 50% price reduction leads to a 3-4x increase in token volume. This is because cheaper tokens unlock use cases that were previously uneconomical:
- At USD 10/M tokens: only high-value, human-in-the-loop tasks justify LLM usage.
- At USD 1/M tokens: automated pipelines (RAG, extraction, classification) become viable.
- At USD 0.10/M tokens: agent loops (10-50 LLM calls per task) become affordable.
- At USD 0.01/M tokens: continuous LLM processing (every email, every document, every message) becomes practical.
Each 10x cost reduction opens an entirely new tier of applications, each consuming 10-100x more tokens than the previous tier. This is why total inference spending is growing even as unit costs fall — the demand curve is steep.
Token Volume Growth vs. Price Reduction (Industry Aggregate)
(relative volume (2023 = 1x))As inference gets cheaper, usage increases faster than cost decreases. Cheaper tokens enable new use cases (reasoning models, agent loops, RAG with massive context, continuous summarization) that consume 10-100x more tokens per task. Total spending on inference is likely to increase even as the unit cost plummets. The inference cost problem does not go away — it shapeshifts.
9. Putting It All Together: A Decision Framework
Step 1: Estimate Your Volume
- Under 100K tokens/day: Use APIs. Do not think about infrastructure.
- 100K-5M tokens/day: Use APIs but optimize prompts. Consider distillation for specific high-volume tasks.
- 5M-50M tokens/day: Self-hosting likely wins. Start with managed GPU providers (Lambda, CoreWeave, RunPod).
- Over 50M tokens/day: Definitely self-host. Consider on-premises for sustained workloads.
Step 2: Choose Your Quality-Cost Point
- Maximum quality needed: Use the largest model available. Accept the cost.
- “Good enough” quality for a defined task: Distill a 7B-13B model. 5-15x cheaper.
- Cascade: Route by difficulty. Best quality-cost tradeoff for mixed workloads.
Step 3: Optimize the Serving Stack
For self-hosting, the priority order of optimizations:
- Continuous batching (5-10x improvement). Use vLLM, SGLang, or TensorRT-LLM — not naive batch inference.
- Quantization (2-3x improvement). AWQ or GPTQ for GPU, GGUF for CPU. Start with INT4.
- Model selection (2-10x improvement). Use the smallest model that meets your quality bar. A distilled 13B is better economics than an undistilled 70B for most tasks.
- Hardware selection (2-5x improvement). Choose the right GPU for your workload. H100 for throughput, A100 for cost-efficiency, Apple Silicon for edge.
- Attention optimization (1.3-2x improvement). FlashAttention, PagedAttention, and KV cache compression.
Step 4: Monitor and Iterate
Track these metrics continuously:
- Cost per million tokens (the bottom line)
- GPU utilization (aim for over 70% during serving hours)
- Latency percentiles (P50, P95, P99 — ensure SLOs are met)
- Token volume (watch for growth that might cross a break-even threshold)
- Quality metrics (accuracy on evaluation set, user satisfaction — ensure optimizations have not degraded quality)
Summary: Cost Per Million Tokens Across Deployment Strategies
| Strategy | Setup Effort | Ongoing Effort | Cost Range ($/M tok) | Best For |
|---|---|---|---|---|
| API (frontier model) | Minutes | None | USD 2.00-15.00 | Prototyping, low volume |
| API (budget model) | Minutes | None | USD 0.15-0.60 | Medium volume, diverse tasks |
| Self-host (70B, optimized) | Weeks | Moderate | USD 0.25-0.50 | High volume, quality-critical |
| Self-host (distilled 7B) | Weeks + training | Moderate | USD 0.04-0.08 | High volume, defined task |
| Cascade (multi-tier) | Months | Significant | USD 0.05-0.10 | Very high volume, mixed tasks |
| On-prem (owned HW) | Months | Significant | USD 0.15-0.30 | Sustained ultra-high volume |
The gap between the most expensive strategy (USD 15/M tokens for frontier API) and the cheapest (USD 0.04/M tokens for self-hosted distilled model) is 375x. This is not a rounding error — it is the difference between a viable product and an unaffordable one. Understanding inference economics is not optional for anyone building on LLMs.
Step 5: Plan for Cost Evolution
The cost landscape changes rapidly. Decisions that are optimal today may be suboptimal in 6 months. Build your architecture for flexibility:
- Abstract the model layer. Use an internal API that can route to self-hosted or external models. When prices change or new models launch, you can switch without modifying application code.
- Re-evaluate quarterly. GPU pricing, model quality, and API pricing all shift frequently. A quarterly review of your cost model ensures you capture savings.
- Avoid long-term GPU commitments unless volume is stable. One-year reserved instances save 30-40% but lock you in. If a new GPU generation launches mid-commitment, you are stuck on outdated hardware. Prefer 3-month commitments or spot instances for variable workloads.
- Track cost per useful token. Not all generated tokens are useful. Reasoning traces, rejected speculative tokens, and retried generations all cost money but produce no user-facing value. The metric that matters is cost per useful output token delivered to the user.
As applications evolve, system prompts tend to grow — more instructions, more examples, more context. A system prompt that grew from 500 tokens to 5,000 tokens over 6 months adds USD 0.00075 per request in prefill cost. At 1 million requests per day, that is USD 22,500 per month in hidden cost from prompt creep alone. Audit your prompts regularly.
10. Conclusion
The economics of LLM inference are dominated by a few key relationships:
-
Batching is king. The single most impactful optimization, providing 5-30x cost reduction by amortizing the memory bandwidth cost of weight loading across multiple requests.
-
Output tokens cost 3-5x more than input tokens. Optimizing output length (through better prompting, constrained decoding, or distillation) has outsized cost impact.
-
The build-vs-buy crossover is around 3-5M tokens per day. Below this, APIs win on total cost of ownership. Above this, self-hosting wins — and the margin widens rapidly with volume.
-
Distillation is the most underappreciated lever. A task-specific 7B model at USD 0.05/M tokens can replace a general 70B model at USD 0.46/M tokens for the vast majority of production workloads.
-
Costs are falling 10x every 2 years. Hardware improvements (HBM bandwidth, FP4 tensor cores) compound with algorithmic improvements (MoE, better attention, KV cache compression) to drive relentless cost reduction.
The cost of LLM inference is dropping rapidly, but the demands on inference are growing even faster. Reasoning models that generate 50,000 tokens per query, agent loops that make dozens of LLM calls per task, and always-on processing pipelines that consume tokens continuously — all of these push total spending upward even as unit costs plummet. The winners will be those who understand the math well enough to operate on the efficient frontier — squeezing maximum quality from every dollar of compute.