xAI trained Grok-3 on 100,000 H100 GPUs running in a single cluster — the largest unified training run in history. The scale is absurd: 9.89 exaFLOPs of peak compute, 200 petabytes of network bandwidth, and cooling systems that draw 150 MW of power. Grok-2 already matched GPT-4-level performance; Grok-3 aims to leapfrog it by training on real-time X/Twitter data that closed labs cannot access. When compute scale is your only advantage, you build the biggest cluster and run it at 100% utilization for six months straight.
What Is Known
Confirmed Details
xAI has published limited technical details about Grok compared to Meta (Llama) or DeepSeek. What is confirmed:
- Grok-1: Open-sourced in March 2024. 314B total parameters, MoE architecture with 8 experts and 2 active per token.
- Grok-2: Closed model, significant quality improvement over Grok-1. Believed to be larger and trained on more data.
- Grok-3: Released in early 2025. Believed to be the model trained on the full Colossus cluster.
- Colossus cluster: Confirmed 100,000 H100 GPUs in Memphis, Tennessee.
- X/Twitter data: Confirmed that xAI uses X platform data for training.
Grok-1 Architecture (Open-Sourced)
Grok-1 is the only version with published architecture details:
GROK_1_CONFIG = {
"architecture": "MoE (Mixture of Experts)",
"total_params": "314B",
"num_experts": 8,
"active_experts": 2, # Top-2 routing
"d_model": 6144,
"num_layers": 64,
"num_attention_heads": 48,
"num_kv_heads": 8, # GQA
"head_dim": 128,
"vocab_size": 131072, # 128K
"max_position_embeddings": 8192,
"activation": "GELU", # Not SwiGLU (unusual choice)
"routing": "Top-2 softmax",
}
def grok1_param_breakdown():
"""
Parameter breakdown for Grok-1 (from open-source release).
"""
d = 6144
d_ff = 32768 # Per expert (estimated from total params)
L = 64
E = 8
Hq = 48
Hkv = 8
hd = 128
# Attention per layer
attn = d * Hq * hd + d * Hkv * hd + d * Hkv * hd + Hq * hd * d
# Expert FFN per layer (each expert)
# Grok-1 uses standard FFN (not SwiGLU): 2 matrices per expert
expert_ffn = 2 * d * d_ff
all_experts = expert_ffn * E
# Router
router = d * E
# Per layer
per_layer = attn + all_experts + router
# Total
total = per_layer * L + 131072 * d # + embeddings
# Active per token
active_experts_ffn = expert_ffn * 2 # Top-2
active_per_layer = attn + active_experts_ffn
active_total = active_per_layer * L
return {
"total_params_B": total / 1e9,
"active_params_B": active_total / 1e9,
"attention_per_layer_M": attn / 1e6,
"expert_ffn_per_layer_M": all_experts / 1e6,
"activation_ratio": active_total / total,
}
Grok-1 Architecture (Open-Sourced)
| Parameter | Value | Comparison |
|---|---|---|
| Total parameters | 314B | Mixtral: 47B, DeepSeek V3: 671B |
| Active parameters | ~86B | Mixtral: 13B, DeepSeek V3: 37B |
| Experts | 8 | Mixtral: 8, DeepSeek V3: 256 |
| Active experts | 2 (top-2) | Same as Mixtral |
| d_model | 6144 | Llama 70B: 8192 |
| Layers | 64 | Llama 70B: 80 |
| Attention | GQA (48 Q, 8 KV) | Llama 70B: GQA (64 Q, 8 KV) |
| Activation | GELU | Unusual: most use SwiGLU |
| Context | 8192 | Short: others support 128K+ |
Grok-1 was open-sourced in March 2024 and represents xAI’s early work. Grok-2 and Grok-3 are significantly more capable. However, Grok-1 is the only version with published architecture details. The analysis below of Grok-2/3 is based on community inference and limited public statements.
Grok-1 Deep Dive
GELU Instead of SwiGLU
Grok-1’s use of GELU activation instead of SwiGLU is notable. By 2024, SwiGLU was nearly universal among frontier models. The implications:
def gelu_vs_swiglu_analysis():
"""
Compare GELU and SwiGLU for expert FFN computation.
"""
d_model = 6144
d_ff = 32768
# GELU FFN: 2 matrices (up + down)
gelu_params = 2 * d_model * d_ff
gelu_flops_per_token = 2 * gelu_params # Two matmuls
# SwiGLU FFN: 3 matrices (gate + up + down)
# To match GELU param count, d_ff would be 2/3 * 32768 = 21845
swiglu_d_ff = int(d_ff * 2 / 3)
swiglu_params = 3 * d_model * swiglu_d_ff
swiglu_flops_per_token = 2 * swiglu_params
return {
"gelu_params_M": gelu_params / 1e6,
"swiglu_params_M": swiglu_params / 1e6,
"gelu_flops_per_token_M": gelu_flops_per_token / 1e6,
"swiglu_flops_per_token_M": swiglu_flops_per_token / 1e6,
"quality_difference": "SwiGLU typically 1-3% better perplexity",
}
The GELU choice suggests Grok-1 was developed quickly and may have been based on an earlier architecture template. Grok-2 and Grok-3 likely switched to SwiGLU.
MoE Configuration Analysis
Grok-1 uses a classic Mixtral-style MoE: 8 experts, top-2 routing. This is conservative compared to DeepSeek V3’s 256 fine-grained experts.
def grok1_moe_efficiency():
"""
Analyze Grok-1's MoE efficiency vs alternatives.
"""
configs = {
"Grok-1": {
"total_B": 314,
"active_B": 86,
"experts": 8,
"top_k": 2,
"activation_ratio": 86 / 314, # 27.4%
},
"Mixtral 8x7B": {
"total_B": 47,
"active_B": 13,
"experts": 8,
"top_k": 2,
"activation_ratio": 13 / 47, # 27.7%
},
"DeepSeek V3": {
"total_B": 671,
"active_B": 37,
"experts": 256,
"top_k": 8,
"activation_ratio": 37 / 671, # 5.5%
},
}
# Grok-1 and Mixtral have the same activation ratio (~27%)
# DeepSeek V3 achieves much better parameter efficiency (5.5%)
# Grok-1 activates 86B per token — more than many dense models
return configs
Parameter Efficiency: Active vs Total
(activation ratio (%))The Colossus Cluster
Hardware Scale
Colossus is the largest known single-site GPU cluster:
COLOSSUS_SPECS = {
"gpus": 100000, # H100 SXM5
"gpu_memory_per_gpu_gb": 80,
"total_gpu_memory_pb": 8.0, # 8 petabytes
"peak_fp16_pflops": 98900, # ~99 petaFLOPS FP16
"peak_fp8_pflops": 197800, # ~198 petaFLOPS FP8
"interconnect": "InfiniBand NDR (400 Gbps)",
"location": "Memphis, Tennessee",
"power": "~150 MW estimated",
"cost": "~$4-6B estimated (GPUs + infrastructure)",
}
def colossus_training_capacity():
"""
What Colossus can train in terms of model scale.
"""
# FP16 training throughput
gpu_count = 100000
fp16_tflops_per_gpu = 989 # H100
# Assume 40% MFU (realistic for large-scale training)
mfu = 0.40
effective_tflops = gpu_count * fp16_tflops_per_gpu * mfu
# = 39,560,000 TFLOPS = 39.56 exaFLOPS effective
# Training a 1T parameter model on 15T tokens
# FLOPs = 6 * N * D (Chinchilla formula, forward + backward)
model_params = 1e12
tokens = 15e12
total_flops = 6 * model_params * tokens # 9e25 FLOPs
training_time_seconds = total_flops / (effective_tflops * 1e12)
training_time_days = training_time_seconds / 86400
return {
"effective_exaflops": effective_tflops / 1e6,
"1T_model_15T_tokens_days": training_time_days,
"cost_per_day_usd": gpu_count * 2.0 * 24, # $2/hr/GPU
}
# Result: ~26 days to train a 1T parameter model on 15T tokens
# At ~$4.8M/day, total training cost: ~$125M
GPU Cluster Comparison (Frontier Labs)
| Lab | Cluster | GPUs | Est. Peak FP16 PFLOPS | Notes |
|---|---|---|---|---|
| xAI | Colossus | 100K+ H100 | 99,000 | Largest single cluster |
| Meta | Multiple clusters | 600K+ H100 | 593,000 | Distributed across data centers |
| TPU v5p pods | ~50K TPU v5p | ~23,000 (BF16) | Custom silicon | |
| Microsoft/OpenAI | Azure clusters | ~100K+ H100 | ~99,000 | Shared with Azure |
| DeepSeek | Custom cluster | ~10K+ H800 | ~8,000 | Chinese hardware restrictions |
Single-Cluster Advantage
Colossus being a single cluster (not distributed across data centers) provides a significant advantage for large-model training:
def single_cluster_advantage():
"""
Why a single large cluster beats multiple smaller clusters.
"""
advantages = {
"all_reduce_latency": {
"single_cluster": "~50 us (InfiniBand within building)",
"multi_datacenter": "~10-50 ms (WAN latency)",
"speedup": "200-1000x lower latency",
},
"bandwidth": {
"single_cluster": "400 Gbps (InfiniBand NDR) per link",
"multi_datacenter": "~100 Gbps (WAN) shared",
"speedup": "4x+ per link, much more aggregate",
},
"pipeline_parallelism": {
"single_cluster": "All stages connected via IB",
"multi_datacenter": "Cross-DC stages add ms-level latency per microbatch",
"impact": "2-5x better pipeline efficiency",
},
"expert_parallelism": {
"single_cluster": "All-to-all within IB fabric",
"multi_datacenter": "All-to-all across WAN is impractical",
"impact": "MoE training at scale requires single-cluster",
},
}
return advantages
Having 100K H100s in a single cluster with InfiniBand interconnect enables training configurations that are impossible with distributed clusters. In particular, MoE models with expert parallelism require low-latency all-to-all communication that breaks down over WAN connections. Colossus can train much larger MoE models with more experts than labs limited to smaller individual clusters, even if those labs have more total GPUs.
Inferred Grok-3 Architecture
What We Can Infer
Based on Colossus’s capability, Grok-1’s architecture, and competitive benchmarks, Grok-3 likely:
GROK_3_INFERRED_CONFIG = {
"architecture": "MoE (fine-grained, likely inspired by DeepSeek V3)",
"total_params": "1T-2T (estimated)",
"active_params": "100B-200B per token",
"num_experts": "64-256 (fine-grained)",
"active_experts": "8-16",
"d_model": "8192-12288",
"num_layers": "80-128",
"attention": "GQA or MLA",
"activation": "SwiGLU (likely upgraded from GELU)",
"context_length": "128K-1M",
"training_tokens": "20T+ (given Colossus scale)",
"training_precision": "FP8 or BF16",
}
Scale Estimation
Given Colossus can train a 1T model on 15T tokens in ~26 days, and xAI had months of training time available:
def estimate_grok3_training():
"""
Estimate Grok-3 training parameters based on Colossus capability.
"""
# Colossus: 100K H100, ~40% MFU
effective_tflops = 100000 * 989 * 0.40 # TFLOPS
# Assume 3 months of training (90 days)
training_seconds = 90 * 86400
total_flops = effective_tflops * 1e12 * training_seconds
# = 39.56e6 * 1e12 * 7.776e6 = ~3.08e26 FLOPs
# What model/data combinations are feasible?
scenarios = {
"1T params, 50T tokens": {
"flops": 6 * 1e12 * 50e12, # 3e26
"feasible": True,
},
"2T params, 25T tokens": {
"flops": 6 * 2e12 * 25e12, # 3e26
"feasible": True,
},
"500B params, 100T tokens": {
"flops": 6 * 500e9 * 100e12, # 3e26
"feasible": True,
},
}
return {
"total_budget_flops": total_flops,
"scenarios": scenarios,
}
Feasible Grok-3 Configurations (90-Day Training on Colossus)
(% of 90-day Colossus budget used)Real-Time Information Integration
X/Twitter Data Advantage
xAI has exclusive access to X/Twitter data — a massive stream of real-time human-generated text covering every topic, language, and perspective. This is a genuine competitive advantage.
def x_data_analysis():
"""
Analyze the X/Twitter data advantage.
"""
x_data = {
"daily_posts": "~500M posts/day",
"daily_tokens": "~50B tokens/day (estimated)",
"annual_tokens": "~18T tokens/year",
"unique_properties": [
"Real-time (no crawl delay)",
"Conversational (not just articles)",
"Multilingual",
"Covers breaking events as they happen",
"Includes expert discussions and debates",
"Contains code snippets and technical discussions",
],
"challenges": [
"High noise ratio (spam, low-quality)",
"Short texts (tweets are brief)",
"Bias toward certain demographics and topics",
"Offensive content requires careful filtering",
],
}
return x_data
RAG-Style Integration
Grok’s real-time information access is likely implemented through Retrieval-Augmented Generation (RAG) rather than continuous retraining:
class GrokRealtimeRAG:
"""
Hypothesized real-time information integration for Grok.
"""
def __init__(self, model, x_index):
self.model = model
self.x_index = x_index # Real-time index of X posts
def answer_with_realtime(self, query):
"""
Augment model generation with real-time X data.
"""
# Step 1: Retrieve relevant recent posts
recent_posts = self.x_index.search(
query=query,
max_results=50,
recency_hours=24,
quality_filter=True,
)
# Step 2: Build augmented context
context = self._format_retrieved_posts(recent_posts)
# Step 3: Generate with augmented context
augmented_prompt = f"""
Recent information from X:
{context}
User question: {query}
Based on the above real-time information and your training knowledge,
provide an accurate and up-to-date answer.
"""
response = self.model.generate(augmented_prompt)
return response, recent_posts
def _format_retrieved_posts(self, posts):
"""Format retrieved posts for model context."""
formatted = []
for post in posts:
formatted.append(
f"[@{post['author']} ({post['timestamp']})]: {post['text']}"
)
return "\n".join(formatted)
There are two approaches to keeping a model up-to-date: (1) continuous pre-training on new data, and (2) retrieval-augmented generation at inference time. Continuous pre-training is expensive and risks catastrophic forgetting. RAG is cheaper and more targeted but requires the model to correctly integrate retrieved information with its parametric knowledge. Grok likely uses RAG for real-time information, with periodic re-training for deeper knowledge updates.
Grok Benchmark Performance
Available Results
Grok Performance on Key Benchmarks
| Benchmark | Grok-2 | GPT-4o | Claude 3.5 Sonnet | DeepSeek V3 |
|---|---|---|---|---|
| MMLU | 87.5 | 88.7 | 88.3 | 88.5 |
| HumanEval | 88.4 | 90.2 | 92.0 | 92.7 |
| MATH | 76.1 | 76.6 | 78.3 | 90.2 |
| GPQA | 56.0 | 53.6 | 59.4 | 59.1 |
| Arena ELO | ~1260 | ~1280 | ~1270 | ~1290 |
Grok-2 performs at or near the frontier on most benchmarks but does not lead any single category. Grok-3 is expected to improve significantly given the Colossus training scale.
xAI’s Scaling Philosophy
Compute-First Approach
xAI’s strategy differs from other labs:
def xai_vs_others():
"""
Compare xAI's approach to other frontier labs.
"""
philosophies = {
"xAI (Grok)": {
"core_bet": "Scale compute aggressively",
"hardware_strategy": "Build largest possible single cluster",
"data_strategy": "Leverage X/Twitter for unique real-time data",
"architecture_strategy": "Follow proven designs, scale them",
"alignment_strategy": "Less emphasis on safety, more on capabilities",
"release_strategy": "Partially open (Grok-1), mostly closed",
},
"Anthropic (Claude)": {
"core_bet": "Alignment is the bottleneck",
"hardware_strategy": "Cloud (AWS partnership)",
"data_strategy": "Standard web + high-quality curation",
"architecture_strategy": "Dense, standard, focus on alignment methods",
"alignment_strategy": "Constitutional AI, extensive RLHF",
"release_strategy": "Closed (API only)",
},
"DeepSeek": {
"core_bet": "Efficiency is the bottleneck",
"hardware_strategy": "Limited hardware, maximize efficiency",
"data_strategy": "Standard web + synthetic",
"architecture_strategy": "Innovate on architecture (MoE, MLA, FP8)",
"alignment_strategy": "GRPO, moderate",
"release_strategy": "Open weights + detailed technical reports",
},
"Meta (Llama)": {
"core_bet": "Open-source ecosystem is a competitive moat",
"hardware_strategy": "Massive GPU fleet (600K+ H100)",
"data_strategy": "15T+ tokens, broad coverage",
"architecture_strategy": "Simple dense architecture, over-train for inference efficiency",
"alignment_strategy": "DPO, safety SFT",
"release_strategy": "Fully open weights",
},
}
return philosophies
The Brute-Force Argument
xAI’s bet is that raw scale — more GPUs, more data, more training time — can compensate for less architectural innovation:
def scale_vs_efficiency():
"""
When does raw scale beat architectural innovation?
"""
# DeepSeek V3: 671B params, 14.8T tokens, $5.6M training cost
# Grok-3 (est): 1T+ params, 50T+ tokens, $100M+ training cost
# If Grok-3 trains 20x more compute:
# At constant architecture, scaling laws predict:
# Loss improvement = (compute_ratio)^(-0.05) for Chinchilla scaling
# 20x compute -> (20)^(-0.05) = 0.86 -> 14% lower loss
# DeepSeek V3 achieved better results through efficiency:
# MoE (train more params for same FLOPs)
# FP8 (double throughput)
# DualPipe (eliminate bubbles)
# Combined: ~18x more effective than naive training
# So 20x brute-force compute vs 18x efficiency improvement
# These are roughly comparable
# But xAI ALSO has architectural innovations (just less published)
analysis = {
"brute_force_advantage": "20x more compute",
"efficiency_advantage": "18x from MoE + FP8 + DualPipe",
"conclusion": "At equivalent scale, efficiency wins. "
"But xAI has BOTH scale AND (presumably) some efficiency.",
}
return analysis
Training Compute Budget Comparison (Estimated)
(relative training compute (arbitrary units))Grok-1 Open Source Analysis
Code Structure
When xAI open-sourced Grok-1, the community analyzed the architecture:
class Grok1MoELayer(nn.Module):
"""
Grok-1 MoE layer (reconstructed from open-source release).
"""
def __init__(self, d_model=6144, d_ff=32768, num_experts=8, top_k=2):
super().__init__()
self.num_experts = num_experts
self.top_k = top_k
# Router
self.gate = nn.Linear(d_model, num_experts, bias=False)
# Experts: standard GELU FFN (not SwiGLU)
self.experts = nn.ModuleList([
nn.Sequential(
nn.Linear(d_model, d_ff, bias=False),
nn.GELU(),
nn.Linear(d_ff, d_model, bias=False),
)
for _ in range(num_experts)
])
def forward(self, x):
logits = self.gate(x)
probs = torch.softmax(logits, dim=-1)
top_k_probs, top_k_idx = probs.topk(self.top_k, dim=-1)
weights = top_k_probs / top_k_probs.sum(dim=-1, keepdim=True)
output = torch.zeros_like(x)
for k in range(self.top_k):
for e in range(self.num_experts):
mask = (top_k_idx[:, k] == e)
if mask.any():
out = self.experts[e](x[mask])
output[mask] += weights[mask, k:k+1] * out
return output
Community Observations
When Grok-1 was released, the community noted several architectural choices:
GROK1_COMMUNITY_OBSERVATIONS = {
"gelu_not_swiglu": {
"observation": "Uses GELU instead of the near-universal SwiGLU",
"interpretation": "Likely developed early, before SwiGLU became standard",
},
"8_experts_only": {
"observation": "Only 8 experts with top-2 (same as Mixtral)",
"interpretation": "Conservative MoE design, not fine-grained like DeepSeek",
},
"no_shared_expert": {
"observation": "No shared/always-active expert",
"interpretation": "Simpler architecture, shared experts not yet adopted",
},
"short_context": {
"observation": "Only 8K context (extended later)",
"interpretation": "Initial training focused on short sequences",
},
"large_ffn_dim": {
"observation": "d_ff=32768 for d_model=6144 (5.3x multiplier)",
"interpretation": "High FFN/attention ratio, prioritizing capacity",
},
}
What Grok-3 Likely Changed
Architectural Upgrades
Based on the competitive landscape and Colossus’s capabilities:
def likely_grok3_improvements():
"""
Likely improvements from Grok-1 to Grok-3.
"""
improvements = {
"swiglu_activation": {
"from": "GELU",
"to": "SwiGLU",
"reason": "Universal consensus, 1-3% quality improvement",
"confidence": "Very high",
},
"fine_grained_moe": {
"from": "8 experts, top-2",
"to": "64-256 experts, top-8-16",
"reason": "DeepSeek V3 proved fine-grained MoE is strictly better",
"confidence": "High",
},
"extended_context": {
"from": "8K",
"to": "128K-1M",
"reason": "Competitive requirement",
"confidence": "Very high",
},
"mla_or_advanced_attention": {
"from": "Standard GQA",
"to": "MLA or GQA with larger groups",
"reason": "KV cache reduction for long context",
"confidence": "Medium",
},
"fp8_training": {
"from": "BF16 (presumed)",
"to": "FP8 for experts",
"reason": "2x throughput on H100, well-proven by DeepSeek V3",
"confidence": "High",
},
}
return improvements
The Data Moat
X/Twitter as Training Data
def x_data_competitive_analysis():
"""
Analyze X/Twitter data as a competitive advantage.
"""
advantages = {
"volume": {
"description": "~18T tokens/year of fresh data",
"comparison": "Equivalent to the entire Llama 3 training set, annually",
},
"real_time": {
"description": "Data available within seconds of creation",
"comparison": "Common Crawl has months of latency",
},
"conversational": {
"description": "Natural dialogue, debates, Q&A threads",
"comparison": "Web crawl data is mostly articles, not conversations",
},
"diverse_expertise": {
"description": "Posts from domain experts, scientists, engineers",
"comparison": "More diverse perspectives than curated datasets",
},
}
challenges = {
"noise": "80%+ of tweets are low-quality for training",
"length": "Most tweets are very short (under 280 characters)",
"bias": "User demographics skew toward certain groups",
"toxicity": "Significant amount of toxic content to filter",
"legal": "Copyright and data usage concerns",
}
return advantages, challenges
X/Twitter data provides something no other data source offers: real-time human conversation about every conceivable topic. While most of it is noise, the signal-to-noise ratio after aggressive filtering yields high-value training data for conversational AI, current events understanding, and multi-perspective reasoning. No other frontier lab has equivalent access to this data type at this scale.
Grok’s Distinctive Features
Personality and Style
Grok is differentiated by its conversational style:
def grok_style_analysis():
"""
How Grok differs in interaction style from other models.
"""
style = {
"humor": "Encouraged — Grok is trained to use humor and wit",
"directness": "Less hedging than Claude or ChatGPT",
"controversial_topics": "More willing to engage with edgy topics",
"personality": "Modeled after the Hitchhiker's Guide to the Galaxy",
"alignment_philosophy": "Less restrictive than Anthropic or OpenAI",
}
return style
Technical Implications
The stylistic differences reflect different RLHF training choices:
def alignment_comparison():
"""
Different alignment approaches produce different behaviors.
"""
approaches = {
"Claude": {
"refusal_rate": "High for borderline content",
"uncertainty_expression": "Frequent and calibrated",
"humor": "Subtle, mostly absent",
"training_approach": "Constitutional AI — explicit safety principles",
},
"GPT-4": {
"refusal_rate": "Moderate to high",
"uncertainty_expression": "Moderate",
"humor": "Occasional",
"training_approach": "RLHF with extensive safety labeling",
},
"Grok": {
"refusal_rate": "Lower than Claude/GPT-4",
"uncertainty_expression": "Less frequent",
"humor": "Frequent, encouraged",
"training_approach": "RLHF with less restrictive reward model",
},
}
return approaches
Summary
Grok and xAI represent the brute-force scaling approach to frontier AI:
- Grok-1 (open-sourced): 314B MoE with 8 experts, top-2 routing, GELU activation. A solid but architecturally conservative starting point.
- Colossus cluster: 100K+ H100 GPUs in a single site, the largest known training cluster. Enables training configurations impossible on distributed clusters.
- Scale bet: xAI bets that massive compute can compensate for less architectural innovation. With 20x more compute than DeepSeek V3, even moderate efficiency still produces a frontier model.
- X/Twitter data: Unique access to real-time conversational data at massive scale, enabling RAG-based real-time information integration.
- Grok-3 (inferred): Likely 1T+ parameters with fine-grained MoE, SwiGLU, extended context, and FP8 training — incorporating lessons from DeepSeek V3 and others.
- Style differentiation: Less restrictive alignment, more humor, more willingness to engage with edgy topics.
The key question is whether xAI’s compute advantage translates to sustained quality leadership, or whether efficiency-focused labs like DeepSeek can match or exceed Grok’s quality at a fraction of the cost. The scaling laws suggest diminishing returns from pure compute, but xAI’s unique data assets (X/Twitter) provide a durable advantage that no amount of efficiency can replicate.