Gemini was pretrained on text, images, audio, and video simultaneously from day one. Text tokens and image tokens flow through the same transformer layers with shared weights, unlike late-fusion models that train a language model first and add vision afterward. The result: Gemini can answer “What sound does this animal make?” by reasoning jointly across an image and audio tokens in a way that LLaVA’s bolt-on vision encoder cannot. Early fusion costs more to train but unlocks reasoning modalities that adapter-based multimodal models struggle to reach.
Early Fusion vs Late Fusion
The Architecture Difference
Late-fusion models (LLaVA, Llama Vision, Qwen-VL) train a language model and a vision encoder separately, then connect them with a projection layer:
class LateFusionVLM:
"""
Late fusion: separate encoders, connected by projection.
LLaVA, Llama Vision, etc.
"""
def __init__(self):
# Pre-trained separately
self.vision_encoder = CLIP_ViT() # Frozen or fine-tuned
self.language_model = LlamaModel() # Pre-trained LLM
self.projection = nn.Linear(1024, 4096) # Learned connector
def forward(self, text_tokens, image):
# Encode image separately
image_features = self.vision_encoder(image) # [B, N_patches, 1024]
# Project to LLM dimension
image_tokens = self.projection(image_features) # [B, N_patches, 4096]
# Concatenate with text tokens
text_embeddings = self.language_model.embed(text_tokens)
combined = torch.cat([image_tokens, text_embeddings], dim=1)
# Process through LLM
output = self.language_model.transformer(combined)
return output
Early-fusion models (Gemini) tokenize all modalities into a shared token space and train a single transformer on the interleaved sequence:
class EarlyFusionMultimodal:
"""
Early fusion: all modalities share the same transformer.
Gemini architecture (simplified).
"""
def __init__(self):
# Modality-specific tokenizers (but shared transformer)
self.text_tokenizer = BPETokenizer()
self.image_tokenizer = ImageTokenizer() # Patches to tokens
self.audio_tokenizer = AudioTokenizer() # Spectrograms to tokens
self.video_tokenizer = VideoTokenizer() # Frames to tokens
# Single unified transformer
self.transformer = UnifiedTransformer(d_model=8192, num_layers=128)
# Modality-specific de-tokenizers for generation
self.text_head = nn.Linear(8192, vocab_size)
self.image_head = ImageDecoder()
def forward(self, multimodal_input):
# Tokenize each modality
tokens = []
for modality, data in multimodal_input:
if modality == "text":
tokens.extend(self.text_tokenizer(data))
elif modality == "image":
tokens.extend(self.image_tokenizer(data))
elif modality == "audio":
tokens.extend(self.audio_tokenizer(data))
elif modality == "video":
tokens.extend(self.video_tokenizer(data))
# Process all tokens through the same transformer
token_embeddings = self.embed(tokens) # Shared embedding space
hidden_states = self.transformer(token_embeddings)
return hidden_states
Why Early Fusion Matters
The critical difference: in early fusion, image tokens attend to text tokens and vice versa at every layer. In late fusion, the vision encoder processes the image independently before the language model ever sees it.
This means early fusion can learn:
- Cross-modal attention patterns (e.g., “the red car” attending directly to the red pixels)
- Modality-aware position encoding (understanding that image tokens represent spatial layout)
- Joint representations from the first layer
Late fusion can only learn these through the projection layer, which is a bottleneck.
Early Fusion vs Late Fusion Comparison
| Feature | Early Fusion (Gemini) | Late Fusion (LLaVA/Llama Vision) |
|---|---|---|
| Cross-modal attention | Every layer | Only after projection |
| Training cost | High (train from scratch) | Low (adapt pre-trained models) |
| Modality interaction depth | Deep (128 layers) | Shallow (1 projection layer) |
| Text-only performance | Strong (multimodal training helps) | Depends on LLM base |
| Adding new modalities | Requires retraining | Add new encoder + projection |
| Generation capability | Can generate any modality | Usually text-only output |
Image Tokenization
Converting Pixels to Tokens
Gemini converts images into discrete tokens that can be interleaved with text tokens. The process:
class ImageTokenizer:
"""
Convert image to a sequence of tokens for the transformer.
Based on ViT-style patch embedding.
"""
def __init__(
self,
image_size=896, # Input resolution
patch_size=14, # Each patch becomes one token
d_model=8192,
max_tokens=4096, # Max tokens per image
):
self.patch_size = patch_size
self.num_patches = (image_size // patch_size) ** 2 # 4096 patches
# Patch embedding: project each patch to d_model
self.patch_embed = nn.Conv2d(
3, d_model,
kernel_size=patch_size,
stride=patch_size,
)
# Learnable 2D position embedding
self.pos_embed = nn.Parameter(
torch.randn(1, self.num_patches, d_model)
)
# Optional: reduce token count via pooling
self.token_reduction = nn.AdaptiveAvgPool1d(max_tokens // 4)
def tokenize(self, image):
"""
image: [B, 3, H, W]
returns: [B, N_tokens, d_model]
"""
# Extract patches
patches = self.patch_embed(image) # [B, d_model, H/P, W/P]
B, D, H, W = patches.shape
patches = patches.reshape(B, D, -1).permute(0, 2, 1) # [B, N, D]
# Add position embedding
patches = patches + self.pos_embed
return patches
Dynamic Resolution
Gemini supports variable-resolution images. Rather than resizing all images to a fixed size, it uses a dynamic number of tokens proportional to the image resolution:
def compute_image_tokens(width, height, patch_size=14, max_tokens=4096):
"""
Compute number of tokens for a given image resolution.
"""
patches_w = width // patch_size
patches_h = height // patch_size
total_patches = patches_w * patches_h
if total_patches > max_tokens:
# Downsample to fit within token budget
scale = (max_tokens / total_patches) ** 0.5
patches_w = int(patches_w * scale)
patches_h = int(patches_h * scale)
total_patches = patches_w * patches_h
return total_patches, patches_w, patches_h
# Examples
# 224x224 image: 256 tokens (compact)
# 896x896 image: 4096 tokens (detailed)
# 1792x1792 image: 4096 tokens (capped, downsampled)
Audio and Video Tokenization
Audio
Gemini processes audio as a spectrogram, divided into time-frequency patches:
class AudioTokenizer:
"""
Convert audio waveform to tokens via spectrogram patches.
"""
def __init__(
self,
sample_rate=16000,
n_fft=1024,
hop_length=320, # 20ms hop
n_mels=128,
patch_time=8, # 8 spectrogram frames per patch
patch_freq=16, # 16 mel bins per patch
d_model=8192,
):
self.mel_spec = MelSpectrogram(
sample_rate=sample_rate,
n_fft=n_fft,
hop_length=hop_length,
n_mels=n_mels,
)
self.patch_time = patch_time
self.patch_freq = patch_freq
# Project patches to d_model
patch_dim = patch_time * patch_freq
self.patch_proj = nn.Linear(patch_dim, d_model)
def tokenize(self, waveform):
"""
waveform: [B, num_samples]
returns: [B, N_tokens, d_model]
"""
# Compute mel spectrogram
spec = self.mel_spec(waveform) # [B, n_mels, T_frames]
# Reshape into patches
B, F, T = spec.shape
# Pad to align with patch sizes
T_padded = ((T + self.patch_time - 1) // self.patch_time) * self.patch_time
F_padded = ((F + self.patch_freq - 1) // self.patch_freq) * self.patch_freq
spec = torch.nn.functional.pad(spec, (0, T_padded - T, 0, F_padded - F))
# Extract patches
patches = spec.unfold(1, self.patch_freq, self.patch_freq) # freq patches
patches = patches.unfold(2, self.patch_time, self.patch_time) # time patches
# Shape: [B, F/pf, T/pt, pf, pt]
patches = patches.reshape(B, -1, self.patch_freq * self.patch_time)
# Project to d_model
tokens = self.patch_proj(patches)
return tokens
Video
Video tokenization decomposes frames temporally and spatially:
class VideoTokenizer:
"""
Tokenize video: sample frames, tokenize each, add temporal position.
"""
def __init__(
self,
fps_sample=1, # Sample 1 frame per second
image_tokenizer=None,
max_frames=300, # 5 minute video at 1fps
d_model=8192,
):
self.fps_sample = fps_sample
self.image_tokenizer = image_tokenizer or ImageTokenizer()
self.max_frames = max_frames
# Temporal position embedding
self.temporal_embed = nn.Embedding(max_frames, d_model)
def tokenize(self, video_frames):
"""
video_frames: [B, T, 3, H, W]
returns: [B, T * tokens_per_frame, d_model]
"""
B, T, C, H, W = video_frames.shape
T = min(T, self.max_frames)
all_tokens = []
for t in range(T):
frame = video_frames[:, t] # [B, 3, H, W]
frame_tokens = self.image_tokenizer.tokenize(frame) # [B, N, D]
# Add temporal position
temporal_pos = self.temporal_embed(
torch.tensor([t], device=frame.device)
).unsqueeze(1) # [1, 1, D]
frame_tokens = frame_tokens + temporal_pos
all_tokens.append(frame_tokens)
tokens = torch.cat(all_tokens, dim=1) # [B, T*N, D]
return tokens
Token Budget
Token Budget per Modality in Gemini
| Modality | Tokens per Unit | Example | Context Budget |
|---|---|---|---|
| Text | 1 token per subword | 1000 words = ~750 tokens | Standard |
| Image (low res) | ~256 tokens | 224x224 image | Compact |
| Image (high res) | ~4096 tokens | 896x896 image | Detailed |
| Audio (per second) | ~25 tokens | 1 minute = ~1500 tokens | Efficient |
| Video (per second) | ~280 tokens | 1 minute = ~16,800 tokens | Expensive |
| Video (1 hour) | ~1M tokens | Fills context window | Maximum |
The 1M Token Context Window
How Google Achieved 1M Tokens
Gemini 1.5 Pro supports a 1 million token context window — 8x longer than Llama 3.1’s 128K. This requires solving three problems:
- Attention complexity: Standard attention is in sequence length. At 1M tokens, this is operations per layer per head.
- KV cache: At 1M tokens, standard KV cache exceeds available GPU memory.
- Position encoding: RoPE and other position encodings must generalize to positions never seen during training.
Sparse Attention
Google has not published the exact attention mechanism, but based on their research (Infini-attention, Ring Attention), the likely approach:
class EfficientLongContextAttention(nn.Module):
"""
Hypothesized attention mechanism for 1M context.
Combines local attention, global attention, and compressed memory.
"""
def __init__(
self,
d_model,
num_heads,
local_window=4096, # Full attention within window
global_stride=64, # Global tokens every 64 positions
memory_size=2048, # Compressed memory slots
):
super().__init__()
self.local_window = local_window
self.global_stride = global_stride
self.memory_size = memory_size
self.q_proj = nn.Linear(d_model, d_model)
self.k_proj = nn.Linear(d_model, d_model)
self.v_proj = nn.Linear(d_model, d_model)
self.o_proj = nn.Linear(d_model, d_model)
# Compressed memory
self.memory_keys = nn.Parameter(
torch.randn(memory_size, d_model // num_heads)
)
self.memory_values = nn.Parameter(
torch.randn(memory_size, d_model // num_heads)
)
def forward(self, x, position_ids):
B, T, D = x.shape
q = self.q_proj(x)
k = self.k_proj(x)
v = self.v_proj(x)
# Local attention: each token attends to nearby tokens
local_output = self._local_attention(q, k, v, self.local_window)
# Global attention: attend to evenly spaced global tokens
global_indices = torch.arange(0, T, self.global_stride)
global_k = k[:, global_indices]
global_v = v[:, global_indices]
global_output = self._cross_attention(q, global_k, global_v)
# Memory attention: attend to compressed memory
memory_output = self._memory_attention(q)
# Combine via gating
output = local_output + global_output + memory_output
return self.o_proj(output)
def _local_attention(self, q, k, v, window):
"""Sliding window attention."""
# Standard attention but masked to local window
# Implementation uses block-sparse attention
pass
def _cross_attention(self, q, k, v):
"""Cross-attention to global/memory tokens."""
pass
def _memory_attention(self, q):
"""Attend to compressed memory slots."""
pass
Infini-Attention (Google Research)
Google published Infini-attention, which combines standard attention with a compressive memory that stores summaries of distant context:
class InfiniAttention(nn.Module):
"""
Infini-attention: local attention + compressive memory.
From 'Leave No Context Behind' (Munkhdalai et al., 2024).
"""
def __init__(self, d_model, num_heads, segment_size=4096):
super().__init__()
self.segment_size = segment_size
self.num_heads = num_heads
self.head_dim = d_model // num_heads
# Standard attention projections
self.qkv_proj = nn.Linear(d_model, 3 * d_model)
self.o_proj = nn.Linear(d_model, d_model)
# Compressive memory (per head)
# Stores key-value associations using linear attention
self.memory_key = None # Updated during forward
self.memory_value = None
self.normalizer = None
# Gating between local attention and memory
self.gate = nn.Parameter(torch.zeros(num_heads))
def forward(self, x):
B, T, D = x.shape
num_segments = (T + self.segment_size - 1) // self.segment_size
outputs = []
for seg in range(num_segments):
start = seg * self.segment_size
end = min(start + self.segment_size, T)
segment = x[:, start:end]
# Standard local attention within segment
qkv = self.qkv_proj(segment)
q, k, v = qkv.chunk(3, dim=-1)
q = q.view(B, -1, self.num_heads, self.head_dim).transpose(1, 2)
k = k.view(B, -1, self.num_heads, self.head_dim).transpose(1, 2)
v = v.view(B, -1, self.num_heads, self.head_dim).transpose(1, 2)
local_attn = torch.matmul(q, k.transpose(-2, -1)) / (self.head_dim ** 0.5)
local_attn = torch.softmax(local_attn, dim=-1)
local_out = torch.matmul(local_attn, v)
# Memory retrieval (from all previous segments)
if self.memory_key is not None:
sigma_q = torch.nn.functional.elu(q) + 1
memory_out = torch.matmul(sigma_q, self.memory_value)
memory_norm = torch.matmul(sigma_q, self.normalizer.unsqueeze(-1))
memory_out = memory_out / (memory_norm + 1e-6)
else:
memory_out = torch.zeros_like(local_out)
# Gate between local and memory
gate = torch.sigmoid(self.gate).view(1, self.num_heads, 1, 1)
combined = gate * memory_out + (1 - gate) * local_out
outputs.append(combined)
# Update memory with current segment's KV
sigma_k = torch.nn.functional.elu(k) + 1
if self.memory_key is None:
self.memory_value = torch.matmul(sigma_k.transpose(-2, -1), v)
self.normalizer = sigma_k.sum(dim=-2)
else:
self.memory_value += torch.matmul(sigma_k.transpose(-2, -1), v)
self.normalizer += sigma_k.sum(dim=-2)
output = torch.cat(outputs, dim=2)
output = output.transpose(1, 2).reshape(B, T, D)
return self.o_proj(output)
Context Length Comparison Across Frontier Models
(max context tokens)TPU-Optimized Training
Why Google Uses TPUs
Google trains Gemini on TPU v5p pods, not NVIDIA GPUs. TPUs offer several advantages for Google’s specific requirements:
def tpu_vs_gpu_comparison():
"""
Compare TPU v5p to H100 for large-scale training.
"""
hardware = {
"TPU v5p": {
"bf16_tflops": 459,
"hbm_gb": 95,
"hbm_bandwidth_tbps": 4.8,
"interconnect": "ICI (inter-chip interconnect)",
"interconnect_bw_gbps": 4800, # Per chip, within pod
"max_pod_chips": 8960,
"cost_advantage": "Google internal, no markup",
},
"H100 SXM": {
"bf16_tflops": 989,
"hbm_gb": 80,
"hbm_bandwidth_tbps": 3.35,
"interconnect": "NVLink + InfiniBand",
"interconnect_bw_gbps": 900, # NVLink within node
"max_cluster": "~100K GPUs",
"cost_advantage": "Available from cloud providers",
},
}
return hardware
TPU v5p vs H100 Comparison
| Feature | TPU v5p | H100 SXM | Advantage |
|---|---|---|---|
| BF16 TFLOPS | 459 | 989 | H100 (2.2x raw compute) |
| HBM capacity | 95 GB | 80 GB | TPU (19% more) |
| HBM bandwidth | 4.8 TB/s | 3.35 TB/s | TPU (43% more) |
| Intra-pod interconnect | 4.8 TB/s ICI | 0.9 TB/s NVLink | TPU (5.3x faster) |
| Pod/cluster scale | 8960 chips | ~16K GPUs (practical) | Comparable |
| All-reduce latency | ~1 us (ICI) | ~5 us (NVLink) | TPU (5x lower) |
The TPU advantage for Gemini training is interconnect bandwidth. Multimodal training with very long sequences requires frequent communication (all-reduce for data parallelism, all-to-all for expert parallelism if MoE is used). TPU ICI provides 5.3x more bandwidth than NVLink, which matters enormously at 1M token context lengths where communication volume is proportional to sequence length.
At 1M token context, the attention computation produces intermediate tensors proportional to (or with efficient attention). These tensors must be communicated across chips during sequence parallelism. TPU ICI’s 4.8 TB/s bandwidth makes this feasible; NVLink’s 0.9 TB/s would create a severe bottleneck. This is a key reason Gemini can support 1M context while GPU-trained models are limited to 128K-200K.
Multimodal Generation
Generating Images
Unlike most VLMs which can only output text, Gemini can generate images, audio, and video. The generation approach:
class MultimodalGenerator:
"""
Generate multiple modalities from the unified transformer.
"""
def __init__(self, transformer, text_head, image_decoder, audio_decoder):
self.transformer = transformer
self.text_head = text_head
self.image_decoder = image_decoder
self.audio_decoder = audio_decoder
def generate(self, prompt_tokens, target_modality="text"):
"""
Generate output in the specified modality.
"""
# Run transformer on prompt
hidden_states = self.transformer(prompt_tokens)
if target_modality == "text":
return self._generate_text(hidden_states)
elif target_modality == "image":
return self._generate_image(hidden_states)
elif target_modality == "audio":
return self._generate_audio(hidden_states)
def _generate_text(self, hidden_states):
"""Standard autoregressive text generation."""
logits = self.text_head(hidden_states[:, -1:])
return logits.argmax(dim=-1)
def _generate_image(self, hidden_states):
"""
Generate image tokens autoregressively,
then decode to pixels.
"""
image_tokens = []
context = hidden_states
for _ in range(4096): # Max image tokens
# Predict next image token
logits = self.image_decoder.token_head(context[:, -1:])
next_token = logits.argmax(dim=-1)
image_tokens.append(next_token)
# Update context
token_embed = self.image_decoder.embed(next_token)
context = torch.cat([context, token_embed], dim=1)
# Decode discrete tokens to pixels
image = self.image_decoder.detokenize(torch.cat(image_tokens, dim=1))
return image
Interleaved Multimodal Output
Gemini can produce interleaved text and images in a single response — for example, generating a step-by-step instruction with diagrams:
def interleaved_generation_example():
"""
Example of interleaved multimodal generation.
"""
output_sequence = [
("text", "Step 1: Draw a circle with radius 5cm."),
("image", "[generated diagram of circle]"),
("text", "Step 2: Draw a tangent line from point P."),
("image", "[generated diagram with tangent]"),
("text", "The tangent line is perpendicular to the radius at the point of tangency."),
]
# The transformer generates this as a single autoregressive sequence,
# with special tokens marking modality transitions.
return output_sequence
Architecture Variants
Gemini Model Family
GEMINI_FAMILY = {
"Gemini 1.5 Flash": {
"params": "Unknown (rumored ~30B)",
"architecture": "Dense or light MoE",
"context": 1_000_000,
"modalities": ["text", "image", "audio", "video"],
"purpose": "Fast, cost-effective inference",
},
"Gemini 1.5 Pro": {
"params": "Unknown (rumored ~MoE, 100B+ active)",
"architecture": "Likely MoE",
"context": 1_000_000,
"modalities": ["text", "image", "audio", "video"],
"purpose": "Balanced quality and speed",
},
"Gemini 2.0 Flash": {
"params": "Unknown",
"architecture": "Likely MoE with efficiency improvements",
"context": 1_000_000,
"modalities": ["text", "image", "audio", "video", "tool use"],
"purpose": "Agentic applications",
},
"Gemini Ultra / 2.0 Pro": {
"params": "Unknown (rumored MoE, 200B+ active)",
"architecture": "MoE",
"context": 1_000_000,
"modalities": ["text", "image", "audio", "video"],
"purpose": "Maximum quality",
},
}
Google has published minimal architectural details for Gemini compared to Meta (Llama) or DeepSeek. The original Gemini 1.0 technical report describes early fusion and TPU training but omits specifics like layer count, d_model, and attention mechanism details. The analysis in this post is partially inferred from Google’s research publications and community analysis.
Benchmark Performance
Multimodal Benchmarks
Multimodal Benchmark Comparison
| Benchmark | Gemini 1.5 Pro | GPT-4o | Claude 3.5 Sonnet |
|---|---|---|---|
| MMMU (multimodal understanding) | 62.2 | 69.1 | 68.3 |
| MathVista (visual math) | 63.9 | 63.8 | 67.7 |
| DocVQA (document understanding) | 93.1 | 92.8 | 95.2 |
| Video-MME (video understanding) | 75.0 | 71.9 | N/A |
| FLEURS (speech recognition) | 7.0 WER | N/A | N/A |
| Needle-in-Haystack (1M ctx) | 99.7% | N/A (128K max) | N/A (200K max) |
Long-Context Performance
Gemini’s 1M context enables use cases impossible for other models:
def long_context_use_cases():
"""
Applications enabled by 1M token context.
"""
cases = {
"entire_codebase": {
"description": "Load an entire codebase (100K+ LOC) in one context",
"tokens_needed": "~300K-500K",
"gemini_capable": True,
"gpt4o_capable": False,
},
"hour_long_video": {
"description": "Process a full 1-hour video with transcription",
"tokens_needed": "~1M (video frames + audio + text)",
"gemini_capable": True,
"gpt4o_capable": False,
},
"book_analysis": {
"description": "Analyze a full novel (200-300 pages)",
"tokens_needed": "~100K-150K",
"gemini_capable": True,
"gpt4o_capable": True, # Fits in 128K
},
"multi_document_qa": {
"description": "QA over hundreds of documents simultaneously",
"tokens_needed": "~500K-1M",
"gemini_capable": True,
"gpt4o_capable": False,
},
}
return cases
Needle-in-Haystack Accuracy at Different Context Lengths
(retrieval accuracy (%))Late Fusion vs Early Fusion: Detailed Analysis
Where Early Fusion Wins
def early_fusion_advantages():
"""
Tasks where early fusion significantly outperforms late fusion.
"""
advantages = {
"spatial_reasoning": {
"task": "Where is the red ball relative to the blue box?",
"early_fusion": "Text tokens attend to spatial positions of visual objects",
"late_fusion": "Vision encoder must encode spatial relationships before LLM sees them",
"gap": "~10% accuracy difference",
},
"fine_grained_ocr": {
"task": "Read small text in a complex document layout",
"early_fusion": "Every image patch token is available at full resolution",
"late_fusion": "Vision encoder may lose fine details in its fixed-size output",
"gap": "~5-15% accuracy on small text",
},
"audio_visual_grounding": {
"task": "Which person in the video is speaking?",
"early_fusion": "Audio and video tokens attend to each other at every layer",
"late_fusion": "Typically cannot do audio-visual correlation",
"gap": "Qualitative difference in capability",
},
"multimodal_generation": {
"task": "Generate an image matching a text description",
"early_fusion": "Can output image tokens directly",
"late_fusion": "Cannot generate images (text output only)",
"gap": "Capability gap (late fusion cannot do this at all)",
},
}
return advantages
Where Late Fusion Wins
def late_fusion_advantages():
"""
Tasks and dimensions where late fusion is preferable.
"""
advantages = {
"training_efficiency": {
"description": "Reuse pre-trained LLM and vision encoder",
"early_fusion_cost": "Train entire model from scratch on multimodal data",
"late_fusion_cost": "Only train projection layer + fine-tune",
"gap": "10-100x less training compute",
},
"text_only_performance": {
"description": "Pure text tasks (no visual input)",
"early_fusion": "May allocate capacity to vision features unnecessarily",
"late_fusion": "LLM backbone optimized purely for text",
"gap": "Marginal, but late fusion can be slightly better",
},
"modularity": {
"description": "Adding new modalities or upgrading components",
"early_fusion": "Must retrain the entire model",
"late_fusion": "Swap out vision encoder or LLM independently",
"gap": "Significant engineering advantage for late fusion",
},
}
return advantages
Inference Cost
The Multimodal Tax
Processing multimodal inputs is more expensive than text-only:
def multimodal_inference_cost():
"""
Compare cost of processing different input types.
"""
costs = {
"text_1000_words": {
"tokens": 750,
"relative_cost": 1.0,
},
"single_image_lowres": {
"tokens": 256,
"relative_cost": 0.34,
},
"single_image_highres": {
"tokens": 4096,
"relative_cost": 5.5,
},
"1_minute_audio": {
"tokens": 1500,
"relative_cost": 2.0,
},
"1_minute_video": {
"tokens": 16800,
"relative_cost": 22.4,
},
"1_hour_video": {
"tokens": 1000000,
"relative_cost": 1333.0,
},
}
return costs
Inference Cost by Input Type (Relative to 1K Words of Text)
| Input | Tokens | Relative Cost | Notes |
|---|---|---|---|
| 1K words text | ~750 | 1.0x | Baseline |
| 1 low-res image | ~256 | 0.34x | Cheaper than text |
| 1 high-res image | ~4096 | 5.5x | 5.5x text cost |
| 1 min audio | ~1500 | 2.0x | Moderate |
| 1 min video | ~16,800 | 22x | Expensive |
| 1 hour video | ~1M | 1333x | Fills context window |
What Is Not Known
Google has disclosed less about Gemini’s architecture than any other frontier lab. Key unknowns:
- Exact parameter count: Not disclosed. Community estimates range from 30B (Flash) to 1.5T+ (Ultra).
- MoE or dense: Widely believed to be MoE but not confirmed for all variants.
- Exact attention mechanism: Whether they use standard attention, sparse attention, linear attention, or a hybrid at 1M context.
- Training data composition: No public details on the training corpus.
- Number of training tokens: Not disclosed.
- Post-training methodology: Known to use RLHF but specifics are proprietary.
This lack of transparency limits reproducibility but does not diminish the technical achievement. The 1M context window with maintained retrieval accuracy and the native multimodal generation are genuinely novel capabilities.
Summary
Gemini represents a different philosophy than Llama or DeepSeek: build a single unified model for all modalities from the ground up.
- Early fusion: All modalities tokenized and processed by the same transformer, enabling deep cross-modal reasoning.
- 1M token context: 8x longer than competitors, enabled by TPU interconnect bandwidth and (likely) sparse/compressive attention.
- Multimodal generation: Can output images, audio, and video, not just text.
- TPU-native: Training optimized for Google’s custom hardware, leveraging ICI bandwidth advantages.
- Limited transparency: Architecture details are largely proprietary, unlike Llama (fully open) or DeepSeek (detailed technical reports).
The early fusion approach is more expensive to train but produces a qualitatively different model: one that can reason across modalities at every layer of the transformer, not just at the input projection. For applications involving video, audio, and complex multimodal reasoning, Gemini’s architecture provides capabilities that late-fusion models fundamentally cannot match.