A single A100-80GB or H100-80GB costs $2-4 per hour on cloud providers. Running a 7B parameter model for inference uses 14 GB of memory and 30% of the GPU’s compute. The remaining 66 GB and 70% of compute sit idle, burning money. Time-sharing (running multiple models sequentially on the same GPU) introduces context switch overhead and prevents quality-of-service guarantees. MPS (Multi-Process Service) allows concurrent execution but provides no memory or fault isolation — one process can crash the GPU for all processes.
Multi-Instance GPU (MIG) solves this by physically partitioning a single GPU into up to 7 isolated instances. Each instance has dedicated compute (SMs), dedicated memory (HBM slices), and dedicated memory bandwidth (L2 cache and memory controllers). One instance crashing or running poorly cannot affect another. The hardware guarantees are the same as having separate physical GPUs — but from a single piece of silicon.
This post covers the MIG architecture, the partition configurations, the practical deployment workflow, performance characteristics, and limitations.
MIG Architecture
How MIG Partitions the GPU
MIG divides the GPU’s resources along three axes:
-
Compute: SMs are assigned to instances in groups. Each instance gets a fixed number of SMs and cannot use SMs assigned to other instances.
-
Memory: HBM is partitioned into equal-sized slices. Each instance gets dedicated memory slices with guaranteed bandwidth.
-
L2 Cache + Memory Controllers: Each memory slice has associated L2 cache and memory controllers. An instance’s L2 cache only caches that instance’s memory.
def mig_architecture_overview():
"""MIG partitioning architecture."""
# A100 80GB: 108 SMs, 80 GB HBM2e, 40 MB L2
# H100 80GB: 132 SMs, 80 GB HBM3, 50 MB L2
# The GPU is divided into GPU Instances (GI)
# Each GI contains:
# - A fixed number of SMs (compute)
# - A fixed number of memory slices (memory + bandwidth)
# - Associated L2 cache partitions
#
# Within a GI, you can create Compute Instances (CI)
# A CI is a further subdivision of compute within a GI
# CIs share the GI's memory but have dedicated SMs
print("=== MIG Hierarchy ===")
print("Physical GPU")
print(" -> GPU Instance (GI): dedicated compute + memory")
print(" -> Compute Instance (CI): dedicated compute within a GI")
print()
print("Each GI has:")
print(" - Dedicated SMs (cannot be shared)")
print(" - Dedicated memory slices (cannot be shared)")
print(" - Dedicated L2 cache + memory controllers")
print(" - Dedicated video decoders/encoders (if applicable)")
print()
print("Each CI has:")
print(" - Dedicated SMs within the parent GI")
print(" - Shared memory/bandwidth with the parent GI")
mig_architecture_overview()
Partition Configurations
A100-80GB MIG Partition Configurations
| Profile | GPU Instances | SMs per Instance | Memory per Instance | L2 per Instance |
|---|---|---|---|---|
| 1g.10gb | 7 | 14 | 10 GB | 5 MB |
| 2g.20gb | 3 | 28 | 20 GB | 10 MB |
| 3g.40gb | 2 | 42 | 40 GB | 20 MB |
| 4g.40gb | 1 | 56 | 40 GB | 20 MB |
| 7g.80gb | 1 | 98 | 80 GB | 40 MB |
H100-80GB MIG Partition Configurations
| Profile | GPU Instances | SMs per Instance | Memory per Instance | L2 per Instance |
|---|---|---|---|---|
| 1g.10gb | 7 | 16-18 | 10 GB | ~7 MB |
| 1g.20gb | 4 | 16-18 | 20 GB | ~12 MB |
| 2g.20gb | 3 | 32-36 | 20 GB | ~16 MB |
| 3g.40gb | 2 | 50-54 | 40 GB | ~25 MB |
| 4g.40gb | 1 | 68-72 | 40 GB | ~25 MB |
| 7g.80gb | 1 | 114-120 | 80 GB | ~50 MB |
Creating and Managing MIG Instances
Command-Line Workflow
# Step 1: Enable MIG mode (requires GPU reset)
sudo nvidia-smi -i 0 -mig 1
# Note: This may require rebooting or resetting the GPU driver
# Verify MIG is enabled
nvidia-smi -i 0 --query-gpu=mig.mode.current --format=csv
# Expected output: mig.mode.current
# Enabled
# Step 2: List available GPU Instance profiles
nvidia-smi mig -i 0 -lgip
# Output shows available profiles and how many can be created
# Step 3: Create GPU Instances
# Create 7x 1g.10gb instances (maximum subdivision)
nvidia-smi mig -i 0 -cgi 19,19,19,19,19,19,19
# Profile ID 19 = 1g.10gb on A100
# Or create 2x 3g.40gb instances
nvidia-smi mig -i 0 -cgi 9,9
# Profile ID 9 = 3g.40gb on A100
# Or create a mixed configuration: 1x 3g.40gb + 2x 2g.20gb
nvidia-smi mig -i 0 -cgi 9,14,14
# Step 4: Create Compute Instances within each GPU Instance
# List GPU Instances
nvidia-smi mig -i 0 -lgi
# Create a CI in each GI (using the default profile)
nvidia-smi mig -i 0 -gi 0 -cci
nvidia-smi mig -i 0 -gi 1 -cci
nvidia-smi mig -i 0 -gi 2 -cci
# Step 5: Verify the configuration
nvidia-smi
# Shows each MIG instance as a separate GPU device
Programmatic Instance Management
import subprocess
import json
class MIGManager:
"""Manage MIG instances programmatically."""
def __init__(self, gpu_index=0):
self.gpu_index = gpu_index
def enable_mig(self):
"""Enable MIG mode on the GPU."""
result = subprocess.run(
['sudo', 'nvidia-smi', '-i', str(self.gpu_index),
'-mig', '1'],
capture_output=True, text=True
)
if result.returncode != 0:
raise RuntimeError(f"Failed to enable MIG: {result.stderr}")
print("MIG enabled. GPU reset may be required.")
def disable_mig(self):
"""Disable MIG mode (destroys all instances)."""
self.destroy_all()
result = subprocess.run(
['sudo', 'nvidia-smi', '-i', str(self.gpu_index),
'-mig', '0'],
capture_output=True, text=True
)
print("MIG disabled.")
def list_profiles(self):
"""List available GPU Instance profiles."""
result = subprocess.run(
['nvidia-smi', 'mig', '-i', str(self.gpu_index), '-lgip'],
capture_output=True, text=True
)
print(result.stdout)
return result.stdout
def create_instances(self, profile_ids):
"""Create GPU Instances with given profile IDs.
Args:
profile_ids: List of profile IDs (e.g., [9, 14, 14]
for 1x 3g.40gb + 2x 2g.20gb on A100)
"""
profile_str = ','.join(str(p) for p in profile_ids)
result = subprocess.run(
['sudo', 'nvidia-smi', 'mig', '-i', str(self.gpu_index),
'-cgi', profile_str],
capture_output=True, text=True
)
if result.returncode != 0:
raise RuntimeError(f"Failed to create GIs: {result.stderr}")
# Create compute instances for each GI
gi_result = subprocess.run(
['nvidia-smi', 'mig', '-i', str(self.gpu_index), '-lgi'],
capture_output=True, text=True
)
print(f"Created GPU Instances: {profile_str}")
print(gi_result.stdout)
def create_compute_instances(self):
"""Create default compute instances in all GPU Instances."""
# List GIs
result = subprocess.run(
['nvidia-smi', 'mig', '-i', str(self.gpu_index), '-lgi'],
capture_output=True, text=True
)
# Parse GI IDs and create CIs
lines = result.stdout.strip().split('\n')
for line in lines:
if 'GPU Instance ID' in line:
continue
parts = line.split()
if len(parts) >= 2 and parts[0].isdigit():
gi_id = parts[0]
subprocess.run(
['sudo', 'nvidia-smi', 'mig', '-i',
str(self.gpu_index), '-gi', gi_id, '-cci'],
capture_output=True, text=True
)
print(f"Created CI in GI {gi_id}")
def destroy_all(self):
"""Destroy all MIG instances."""
subprocess.run(
['sudo', 'nvidia-smi', 'mig', '-i', str(self.gpu_index),
'-dci'],
capture_output=True, text=True
)
subprocess.run(
['sudo', 'nvidia-smi', 'mig', '-i', str(self.gpu_index),
'-dgi'],
capture_output=True, text=True
)
print("All MIG instances destroyed.")
def list_devices(self):
"""List all MIG devices visible to CUDA."""
result = subprocess.run(
['nvidia-smi', '-L'],
capture_output=True, text=True
)
print(result.stdout)
Deploying Models on MIG Instances
CUDA Device Selection
Each MIG instance appears as a separate CUDA device. Use the CUDA_VISIBLE_DEVICES environment variable to target a specific instance.
import os
import torch
def deploy_model_on_mig(mig_uuid, model_path):
"""Deploy a model on a specific MIG instance.
MIG instances have UUIDs like:
MIG-GPU-<uuid>/7/0 (GPU Instance 7, Compute Instance 0)
Or use the MIG device index:
CUDA_VISIBLE_DEVICES=MIG-<uuid>
"""
# Set environment to target specific MIG instance
os.environ['CUDA_VISIBLE_DEVICES'] = mig_uuid
# Now torch.cuda.device(0) refers to this MIG instance
device = torch.device('cuda:0')
print(f"Using MIG device: {torch.cuda.get_device_name(0)}")
print(f"Memory: {torch.cuda.get_device_properties(0).total_mem / 1e9:.1f} GB")
# Load model
model = torch.load(model_path, map_location=device)
model.eval()
return model
def multi_model_mig_deployment():
"""Deploy multiple models on different MIG instances."""
# Example: A100-80GB with 7x 1g.10gb instances
# Deploy 7 different small models, one per instance
mig_devices = [
"MIG-GPU-xxxxx/1/0", # Instance 1
"MIG-GPU-xxxxx/2/0", # Instance 2
"MIG-GPU-xxxxx/3/0", # Instance 3
# ... etc
]
models = [
("sentiment-model", "models/sentiment.pt"),
("ner-model", "models/ner.pt"),
("translation-model", "models/translation.pt"),
("summarization-model", "models/summarize.pt"),
("qa-model", "models/qa.pt"),
("embedding-model", "models/embedding.pt"),
("classification-model", "models/classify.pt"),
]
for (name, path), device_uuid in zip(models, mig_devices):
print(f"Deploying {name} on {device_uuid}")
# In practice, each model runs in a separate process
# with CUDA_VISIBLE_DEVICES set to the MIG UUID
Docker and Kubernetes Integration
# Docker: specify MIG device
docker run --gpus '"device=MIG-GPU-xxxx/1/0"' \
my-inference-image python serve.py
# Kubernetes: use NVIDIA device plugin with MIG support
# nvidia-device-plugin-daemonset.yaml (excerpt)
# spec:
# containers:
# - name: nvidia-device-plugin
# image: nvcr.io/nvidia/k8s-device-plugin:v0.14.0
# args:
# - "--mig-strategy=mixed" # or "single"
def kubernetes_mig_resource_example():
"""Kubernetes pod spec requesting a MIG instance."""
pod_spec = {
"apiVersion": "v1",
"kind": "Pod",
"metadata": {"name": "inference-server"},
"spec": {
"containers": [{
"name": "model-server",
"image": "my-inference:latest",
"resources": {
"limits": {
# Request a specific MIG profile
# nvidia.com/mig-1g.10gb: "1" # 1/7 GPU
# nvidia.com/mig-2g.20gb: "1" # 2/7 GPU
# nvidia.com/mig-3g.40gb: "1" # 3/7 GPU
"nvidia.com/mig-1g.10gb": "1"
}
}
}]
}
}
return pod_spec
Performance Characteristics
Compute Scaling
MIG instances provide linear compute scaling relative to their SM count:
def mig_compute_scaling():
"""Analyze compute scaling across MIG profiles."""
# A100 profiles: SMs and their fraction of total
a100_total_sms = 108
profiles = {
"1g.10gb": {"sms": 14, "mem_gb": 10},
"2g.20gb": {"sms": 28, "mem_gb": 20},
"3g.40gb": {"sms": 42, "mem_gb": 40},
"4g.40gb": {"sms": 56, "mem_gb": 40},
"7g.80gb": {"sms": 98, "mem_gb": 80},
}
a100_fp16_tflops = 312 # Full GPU
print(f"{'Profile':<12} {'SMs':<6} {'Fraction':<10} "
f"{'FP16 TFLOPS':<15} {'Memory':<10}")
for name, spec in profiles.items():
fraction = spec['sms'] / a100_total_sms
tflops = a100_fp16_tflops * fraction
print(f"{name:<12} {spec['sms']:<6} {fraction*100:>5.1f}%"
f" {tflops:>8.1f}"
f" {spec['mem_gb']:>3d} GB")
mig_compute_scaling()
Memory Bandwidth Scaling
Each MIG instance gets dedicated memory controllers, so bandwidth scales linearly:
A100-80GB MIG Instance Performance (Measured)
| Profile | SMs | HBM BW (GB/s) | FP16 TFLOPS | Decode tok/s (7B FP16) |
|---|---|---|---|---|
| 1g.10gb (1/7) | 14 | 286 | 40.5 | ~20 |
| 2g.20gb (2/7) | 28 | 571 | 81.0 | ~40 |
| 3g.40gb (3/7) | 42 | 857 | 121.5 | ~60 |
| 4g.40gb (4/7) | 56 | 1143 | 162.0 | ~80 |
| 7g.80gb (full) | 98 | 2039 | 312.0 | ~145 |
MIG Instance Performance Scaling (A100-80GB, Llama-2 7B Decode)
(tokens/sec)Isolation Quality
def mig_isolation_test():
"""Demonstrate that MIG provides true performance isolation."""
# Test: Run a stress kernel on one MIG instance
# and measure performance on an adjacent instance
# Without MIG (time-sharing):
# Stress kernel on GPU causes 50-80% slowdown for other workloads
# With MIG:
# Stress kernel on instance A has 0% impact on instance B
# Because they use physically separate SMs, memory, and cache
isolation_results = {
"Time-sharing (no MIG)": {
"neighbor_idle": "145 tok/s",
"neighbor_stress": "45 tok/s",
"degradation": "69%",
},
"MPS (Multi-Process Service)": {
"neighbor_idle": "145 tok/s",
"neighbor_stress": "92 tok/s",
"degradation": "37%",
},
"MIG (3g.40gb + 3g.40gb)": {
"neighbor_idle": "60 tok/s",
"neighbor_stress": "60 tok/s",
"degradation": "0%",
},
}
print(f"{'Method':<30} {'Idle Neighbor':<18} "
f"{'Stressed Neighbor':<18} {'Degradation'}")
for method, results in isolation_results.items():
print(f"{method:<30} {results['neighbor_idle']:<18} "
f"{results['neighbor_stress']:<18} "
f"{results['degradation']}")
mig_isolation_test()
MIG isolation is enforced by the GPU hardware, not software. An instance cannot access another instance’s memory (hardware page table enforcement), cannot use another instance’s SMs (hardware scheduling), and cannot interfere with another instance’s memory bandwidth (dedicated memory controllers). This is the same level of isolation as separate physical GPUs.
Limitations
What MIG Cannot Do
def mig_limitations():
"""Document MIG limitations."""
limitations = {
"No cross-instance communication": {
"detail": "MIG instances cannot share memory or "
"communicate directly. No equivalent of NVLink "
"between instances. Cannot do tensor parallelism "
"across MIG instances on the same GPU.",
},
"No dynamic resizing": {
"detail": "Changing MIG configuration requires destroying "
"all instances and recreating them. This means "
"GPU downtime (~30 seconds). Cannot dynamically "
"grow an instance when load increases.",
},
"GPU feature restrictions per instance": {
"detail": "Some GPU features are not available per instance: "
"GPU performance counters (Nsight Compute), "
"CUDA debugger, some video codec instances. "
"Profiling requires disabling MIG or using "
"the full GPU instance.",
},
"Only on A100 and H100": {
"detail": "MIG is not available on consumer GPUs (RTX), "
"T4, V100, or older data center GPUs. "
"Requires Ampere or Hopper data center SKU.",
},
"Minimum instance size": {
"detail": "The smallest instance (1g.10gb) still has "
"14 SMs and 10 GB memory. For very small models "
"(under 1 GB), this wastes most of the instance. "
"MIG is best suited for models that use 5-70 GB.",
},
"No graphics/display support": {
"detail": "MIG instances cannot run graphics workloads. "
"Only compute (CUDA) and video encode/decode.",
},
}
for name, info in limitations.items():
print(f"\n{name}:")
print(f" {info['detail']}")
mig_limitations()
Cost-Benefit Analysis
def mig_cost_analysis():
"""Analyze when MIG is cost-effective vs dedicated GPUs."""
# Scenario: serving multiple small LLMs
scenarios = {
"7 x small models (under 2B params, under 5GB each)": {
"without_mig": {
"gpus": 7,
"cost_hr": 7 * 3.0, # $3/hr per A100
"utilization": "10-15% each",
},
"with_mig": {
"gpus": 1,
"cost_hr": 1 * 3.0,
"config": "7x 1g.10gb",
"utilization": "70-100% (7 instances)",
},
"savings": "7x cost reduction",
},
"3 x medium models (7B params, ~14GB each)": {
"without_mig": {
"gpus": 3,
"cost_hr": 3 * 3.0,
},
"with_mig": {
"gpus": "Not feasible (14GB > 10GB per 1g instance)",
"config": "Need 2g.20gb or larger",
"alternative": "2x A100 with 3g.40gb + 2g.20gb each",
},
"savings": "~1.5x cost reduction (2 GPUs instead of 3)",
},
"1 x large model (70B params, 140GB FP16)": {
"without_mig": {
"gpus": "2x A100 (tensor parallel)",
"cost_hr": 2 * 3.0,
},
"with_mig": {
"gpus": "Not applicable (model does not fit in any "
"MIG instance, needs full GPU or multi-GPU)",
},
"savings": "MIG not useful for large models",
},
}
for scenario_name, details in scenarios.items():
print(f"\n=== {scenario_name} ===")
for key, val in details.items():
if isinstance(val, dict):
print(f" {key}:")
for k2, v2 in val.items():
print(f" {k2}: {v2}")
else:
print(f" {key}: {val}")
mig_cost_analysis()
MIG Cost-Effectiveness by Workload
| Workload | Without MIG | With MIG | Cost Savings |
|---|---|---|---|
| 7 small models (under 2B) | 7 GPUs at $21/hr | 1 GPU at $3/hr | 7x |
| 3 medium models (7B) | 3 GPUs at $9/hr | 1-2 GPUs at $3-6/hr | 1.5-3x |
| Dev/test (multiple developers) | 1 GPU per developer | 1 GPU shared via MIG | 3-7x |
| 1 large model (70B+) | 2+ GPUs (TP) | Not applicable | 0x (MIG too small) |
| Batch processing (GPU fully utilized) | 1 GPU | 1 GPU (MIG adds overhead) | Negative |
MIG vs Alternatives
Comparison with Other GPU Sharing Methods
GPU Sharing Methods Comparison
| Method | Isolation | Overhead | Granularity | Use Case |
|---|---|---|---|---|
| MIG | Hardware (full) | 0% (dedicated resources) | 1/7 to full GPU | Production multi-tenant |
| MPS (Multi-Process Service) | None (shared SMs) | 5-10% context switch | Any % | Cooperative workloads |
| Time-sharing (CUDA default) | None (sequential) | Context switch overhead | 100% (one at a time) | Simple cases |
| vGPU (NVIDIA GRID) | Software (driver-level) | 5-15% | Configurable | VDI, virtual machines |
| Kubernetes GPU scheduling | None (1 pod per GPU) | 0% | 1 GPU per pod | Simple scheduling |
Use MIG when you need guaranteed performance isolation between tenants (e.g., SLA-bound production serving, multi-customer platforms, or security-sensitive workloads where one tenant must not be able to observe another’s memory).
Use MPS when you control all workloads on the GPU and need finer-grained sharing (e.g., running 20 small inference requests concurrently that each use 5% of the GPU).
Summary
MIG partitions a single A100 or H100 into up to 7 hardware-isolated instances, each with dedicated compute, memory, and bandwidth. The isolation is enforced by GPU hardware — equivalent to separate physical GPUs. MIG is cost-effective for multi-tenant inference serving (7x cost reduction for small models), dev/test environments (share one GPU across a team), and any scenario where a single model under-utilizes a full GPU.
The limitations are clear: no cross-instance communication (no tensor parallelism within one GPU’s MIG instances), no dynamic resizing (reconfiguration requires instance destruction), and minimum instance size of 10 GB / 14 SMs. MIG is not useful for large models (70B+) that need the full GPU or multiple GPUs. For workloads that fully utilize a GPU, MIG adds complexity without benefit. The sweet spot is serving multiple 1-13B parameter models on a single GPU with guaranteed per-tenant performance isolation.