MIG and GPU Virtualization: Partitioning a Single GPU for Multi-Tenant Inference

Part of Series GPU Hardware & AI Accelerators 10 of 30

1 NVIDIA GPU Architecture Evolution: Volta, Ampere, Hopper, Blackwell — What Changed and Why 2 HBM Memory: HBM2, HBM2e, HBM3, HBM3e — Bandwidth, Capacity, and Why It Determines AI Performance 3 NVLink, NVSwitch, and GPU Interconnect: From Peer-to-Peer to NVL72 Rack-Scale Fabric 4 The Streaming Multiprocessor: Warp Schedulers, Register File, and the Execution Pipeline 5 AMD MI300X and ROCm: 192GB HBM3, 5.3 TB/s Bandwidth, and the CUDA Software Moat 6 Tensor Core Evolution: From Volta HMMA to Hopper WGMMA — What Changed at Each Generation 7 GPU Memory Hierarchy: L1, L2, Shared Memory, and Cache Behavior Under Different Access Patterns 8 PCIe Gen5 and the CPU-GPU Bandwidth Bottleneck: When PCIe Limits Your Inference 9 MIG and GPU Virtualization: Partitioning a Single GPU for Multi-Tenant Inference 10 Warp Schedulers and Instruction Issue: How GPUs Hide Latency with Thousands of Threads 11 The Register File: 256KB per SM, Register Pressure, and Why More Registers Mean Fewer Threads 12 L2 Cache Behavior: Residency Control, Working Set Effects, and Cache-Aware Kernel Design 13 ECC Memory and GPU Reliability: Silent Data Corruption, Error Detection, and the Cost of ECC 14 NVSwitch Fabric Topology: How 72 GPUs Share a Single Memory Address Space in NVL72 15 Grace Hopper Superchip: Unified CPU-GPU Memory via NVLink-C2C and What It Changes 16 Blackwell B200 Deep Dive: Dual-Die Design, FP4 Tensor Cores, and 8 TB/s HBM3e 17 Google TPU Architecture: MXU, ICI Interconnect, XLA Compilation, and When TPUs Win 18 Intel Gaudi and Habana: Graph Compiler Model, TPC Architecture, and the ROI Calculation 19 GPU Power Efficiency: Performance per Watt, Dynamic Voltage Scaling, and Datacenter Power Budgets 20 GPU Programming Models: CUDA vs ROCm vs Metal vs Vulkan Compute — Portability and Performance 21 Datacenter vs Consumer GPUs: H100 vs RTX 4090 — What You Actually Get for 10x the Price 22 GPU Cooling: Air, Liquid, and Immersion — Thermal Solutions for AI Datacenters 23 GPU Hardware Scheduling: How the GigaThread Engine Distributes Work Across SMs 24 CPU vs GPU Memory: Why GPUs Need Different Optimization 25 Non-NVIDIA AI Accelerators: Gaudi, MI300X, TPU, and the Software Challenge 26 The Definitive Guide to GPU Memory: Registers, Shared Memory, Caches, and HBM 27 GPU Tensor Core Programming: From Volta WMMA to Hopper WGMMA 28 Vector Processing: From ARM NEON to AVX-512 to GPU SIMT 29 Turing vs Volta Architecture for AI Workloads (Jan 2020) 30 Habana Gaudi vs NVIDIA V100: AI Training Performance (Jul 2020)

A single A100-80GB or H100-80GB costs $2-4 per hour on cloud providers. Running a 7B parameter model for inference uses 14 GB of memory and 30% of the GPU’s compute. The remaining 66 GB and 70% of compute sit idle, burning money. Time-sharing (running multiple models sequentially on the same GPU) introduces context switch overhead and prevents quality-of-service guarantees. MPS (Multi-Process Service) allows concurrent execution but provides no memory or fault isolation — one process can crash the GPU for all processes.

Multi-Instance GPU (MIG) solves this by physically partitioning a single GPU into up to 7 isolated instances. Each instance has dedicated compute (SMs), dedicated memory (HBM slices), and dedicated memory bandwidth (L2 cache and memory controllers). One instance crashing or running poorly cannot affect another. The hardware guarantees are the same as having separate physical GPUs — but from a single piece of silicon.

This post covers the MIG architecture, the partition configurations, the practical deployment workflow, performance characteristics, and limitations.

MIG Architecture

How MIG Partitions the GPU

MIG divides the GPU’s resources along three axes:

Compute: SMs are assigned to instances in groups. Each instance gets a fixed number of SMs and cannot use SMs assigned to other instances.
Memory: HBM is partitioned into equal-sized slices. Each instance gets dedicated memory slices with guaranteed bandwidth.
L2 Cache + Memory Controllers: Each memory slice has associated L2 cache and memory controllers. An instance’s L2 cache only caches that instance’s memory.

def mig_architecture_overview():
    """MIG partitioning architecture."""
    # A100 80GB: 108 SMs, 80 GB HBM2e, 40 MB L2
    # H100 80GB: 132 SMs, 80 GB HBM3, 50 MB L2

    # The GPU is divided into GPU Instances (GI)
    # Each GI contains:
    # - A fixed number of SMs (compute)
    # - A fixed number of memory slices (memory + bandwidth)
    # - Associated L2 cache partitions
    #
    # Within a GI, you can create Compute Instances (CI)
    # A CI is a further subdivision of compute within a GI
    # CIs share the GI's memory but have dedicated SMs

    print("=== MIG Hierarchy ===")
    print("Physical GPU")
    print("  -> GPU Instance (GI): dedicated compute + memory")
    print("     -> Compute Instance (CI): dedicated compute within a GI")
    print()
    print("Each GI has:")
    print("  - Dedicated SMs (cannot be shared)")
    print("  - Dedicated memory slices (cannot be shared)")
    print("  - Dedicated L2 cache + memory controllers")
    print("  - Dedicated video decoders/encoders (if applicable)")
    print()
    print("Each CI has:")
    print("  - Dedicated SMs within the parent GI")
    print("  - Shared memory/bandwidth with the parent GI")

mig_architecture_overview()

Partition Configurations

📊

A100-80GB MIG Partition Configurations

Profile	GPU Instances	SMs per Instance	Memory per Instance	L2 per Instance
1g.10gb	7	14	10 GB	5 MB
2g.20gb	3	28	20 GB	10 MB
3g.40gb	2	42	40 GB	20 MB
4g.40gb	1	56	40 GB	20 MB
7g.80gb	1	98	80 GB	40 MB

Note: A100 has 108 SMs total. The 7g.80gb profile uses 98 SMs (some reserved for management). Profiles can be mixed: e.g., 1x 3g.40gb + 2x 2g.20gb.

📊

H100-80GB MIG Partition Configurations

Profile	GPU Instances	SMs per Instance	Memory per Instance	L2 per Instance
1g.10gb	7	16-18	10 GB	~7 MB
1g.20gb	4	16-18	20 GB	~12 MB
2g.20gb	3	32-36	20 GB	~16 MB
3g.40gb	2	50-54	40 GB	~25 MB
4g.40gb	1	68-72	40 GB	~25 MB
7g.80gb	1	114-120	80 GB	~50 MB

Note: H100 has 132 SMs total with more flexible MIG configurations than A100. The 1g.20gb profile is new to H100 and useful for larger small models.

Creating and Managing MIG Instances

Command-Line Workflow

# Step 1: Enable MIG mode (requires GPU reset)
sudo nvidia-smi -i 0 -mig 1
# Note: This may require rebooting or resetting the GPU driver

# Verify MIG is enabled
nvidia-smi -i 0 --query-gpu=mig.mode.current --format=csv
# Expected output: mig.mode.current
#                  Enabled

# Step 2: List available GPU Instance profiles
nvidia-smi mig -i 0 -lgip
# Output shows available profiles and how many can be created

# Step 3: Create GPU Instances
# Create 7x 1g.10gb instances (maximum subdivision)
nvidia-smi mig -i 0 -cgi 19,19,19,19,19,19,19
# Profile ID 19 = 1g.10gb on A100

# Or create 2x 3g.40gb instances
nvidia-smi mig -i 0 -cgi 9,9
# Profile ID 9 = 3g.40gb on A100

# Or create a mixed configuration: 1x 3g.40gb + 2x 2g.20gb
nvidia-smi mig -i 0 -cgi 9,14,14

# Step 4: Create Compute Instances within each GPU Instance
# List GPU Instances
nvidia-smi mig -i 0 -lgi

# Create a CI in each GI (using the default profile)
nvidia-smi mig -i 0 -gi 0 -cci
nvidia-smi mig -i 0 -gi 1 -cci
nvidia-smi mig -i 0 -gi 2 -cci

# Step 5: Verify the configuration
nvidia-smi
# Shows each MIG instance as a separate GPU device

Programmatic Instance Management

import subprocess
import json

class MIGManager:
    """Manage MIG instances programmatically."""

    def __init__(self, gpu_index=0):
        self.gpu_index = gpu_index

    def enable_mig(self):
        """Enable MIG mode on the GPU."""
        result = subprocess.run(
            ['sudo', 'nvidia-smi', '-i', str(self.gpu_index),
             '-mig', '1'],
            capture_output=True, text=True
        )
        if result.returncode != 0:
            raise RuntimeError(f"Failed to enable MIG: {result.stderr}")
        print("MIG enabled. GPU reset may be required.")

    def disable_mig(self):
        """Disable MIG mode (destroys all instances)."""
        self.destroy_all()
        result = subprocess.run(
            ['sudo', 'nvidia-smi', '-i', str(self.gpu_index),
             '-mig', '0'],
            capture_output=True, text=True
        )
        print("MIG disabled.")

    def list_profiles(self):
        """List available GPU Instance profiles."""
        result = subprocess.run(
            ['nvidia-smi', 'mig', '-i', str(self.gpu_index), '-lgip'],
            capture_output=True, text=True
        )
        print(result.stdout)
        return result.stdout

    def create_instances(self, profile_ids):
        """Create GPU Instances with given profile IDs.

        Args:
            profile_ids: List of profile IDs (e.g., [9, 14, 14]
                         for 1x 3g.40gb + 2x 2g.20gb on A100)
        """
        profile_str = ','.join(str(p) for p in profile_ids)
        result = subprocess.run(
            ['sudo', 'nvidia-smi', 'mig', '-i', str(self.gpu_index),
             '-cgi', profile_str],
            capture_output=True, text=True
        )
        if result.returncode != 0:
            raise RuntimeError(f"Failed to create GIs: {result.stderr}")

        # Create compute instances for each GI
        gi_result = subprocess.run(
            ['nvidia-smi', 'mig', '-i', str(self.gpu_index), '-lgi'],
            capture_output=True, text=True
        )
        print(f"Created GPU Instances: {profile_str}")
        print(gi_result.stdout)

    def create_compute_instances(self):
        """Create default compute instances in all GPU Instances."""
        # List GIs
        result = subprocess.run(
            ['nvidia-smi', 'mig', '-i', str(self.gpu_index), '-lgi'],
            capture_output=True, text=True
        )

        # Parse GI IDs and create CIs
        lines = result.stdout.strip().split('\n')
        for line in lines:
            if 'GPU Instance ID' in line:
                continue
            parts = line.split()
            if len(parts) >= 2 and parts[0].isdigit():
                gi_id = parts[0]
                subprocess.run(
                    ['sudo', 'nvidia-smi', 'mig', '-i',
                     str(self.gpu_index), '-gi', gi_id, '-cci'],
                    capture_output=True, text=True
                )
                print(f"Created CI in GI {gi_id}")

    def destroy_all(self):
        """Destroy all MIG instances."""
        subprocess.run(
            ['sudo', 'nvidia-smi', 'mig', '-i', str(self.gpu_index),
             '-dci'],
            capture_output=True, text=True
        )
        subprocess.run(
            ['sudo', 'nvidia-smi', 'mig', '-i', str(self.gpu_index),
             '-dgi'],
            capture_output=True, text=True
        )
        print("All MIG instances destroyed.")

    def list_devices(self):
        """List all MIG devices visible to CUDA."""
        result = subprocess.run(
            ['nvidia-smi', '-L'],
            capture_output=True, text=True
        )
        print(result.stdout)

Deploying Models on MIG Instances

CUDA Device Selection

Each MIG instance appears as a separate CUDA device. Use the CUDA_VISIBLE_DEVICES environment variable to target a specific instance.

import os
import torch

def deploy_model_on_mig(mig_uuid, model_path):
    """Deploy a model on a specific MIG instance.

    MIG instances have UUIDs like:
    MIG-GPU-<uuid>/7/0 (GPU Instance 7, Compute Instance 0)

    Or use the MIG device index:
    CUDA_VISIBLE_DEVICES=MIG-<uuid>
    """
    # Set environment to target specific MIG instance
    os.environ['CUDA_VISIBLE_DEVICES'] = mig_uuid

    # Now torch.cuda.device(0) refers to this MIG instance
    device = torch.device('cuda:0')
    print(f"Using MIG device: {torch.cuda.get_device_name(0)}")
    print(f"Memory: {torch.cuda.get_device_properties(0).total_mem / 1e9:.1f} GB")

    # Load model
    model = torch.load(model_path, map_location=device)
    model.eval()

    return model

def multi_model_mig_deployment():
    """Deploy multiple models on different MIG instances."""
    # Example: A100-80GB with 7x 1g.10gb instances
    # Deploy 7 different small models, one per instance

    mig_devices = [
        "MIG-GPU-xxxxx/1/0",  # Instance 1
        "MIG-GPU-xxxxx/2/0",  # Instance 2
        "MIG-GPU-xxxxx/3/0",  # Instance 3
        # ... etc
    ]

    models = [
        ("sentiment-model", "models/sentiment.pt"),
        ("ner-model", "models/ner.pt"),
        ("translation-model", "models/translation.pt"),
        ("summarization-model", "models/summarize.pt"),
        ("qa-model", "models/qa.pt"),
        ("embedding-model", "models/embedding.pt"),
        ("classification-model", "models/classify.pt"),
    ]

    for (name, path), device_uuid in zip(models, mig_devices):
        print(f"Deploying {name} on {device_uuid}")
        # In practice, each model runs in a separate process
        # with CUDA_VISIBLE_DEVICES set to the MIG UUID

Docker and Kubernetes Integration

# Docker: specify MIG device
docker run --gpus '"device=MIG-GPU-xxxx/1/0"' \
    my-inference-image python serve.py

# Kubernetes: use NVIDIA device plugin with MIG support
# nvidia-device-plugin-daemonset.yaml (excerpt)
# spec:
#   containers:
#   - name: nvidia-device-plugin
#     image: nvcr.io/nvidia/k8s-device-plugin:v0.14.0
#     args:
#     - "--mig-strategy=mixed"  # or "single"

def kubernetes_mig_resource_example():
    """Kubernetes pod spec requesting a MIG instance."""
    pod_spec = {
        "apiVersion": "v1",
        "kind": "Pod",
        "metadata": {"name": "inference-server"},
        "spec": {
            "containers": [{
                "name": "model-server",
                "image": "my-inference:latest",
                "resources": {
                    "limits": {
                        # Request a specific MIG profile
                        # nvidia.com/mig-1g.10gb: "1"  # 1/7 GPU
                        # nvidia.com/mig-2g.20gb: "1"  # 2/7 GPU
                        # nvidia.com/mig-3g.40gb: "1"  # 3/7 GPU
                        "nvidia.com/mig-1g.10gb": "1"
                    }
                }
            }]
        }
    }
    return pod_spec

Performance Characteristics

Compute Scaling

MIG instances provide linear compute scaling relative to their SM count:

def mig_compute_scaling():
    """Analyze compute scaling across MIG profiles."""
    # A100 profiles: SMs and their fraction of total
    a100_total_sms = 108

    profiles = {
        "1g.10gb": {"sms": 14, "mem_gb": 10},
        "2g.20gb": {"sms": 28, "mem_gb": 20},
        "3g.40gb": {"sms": 42, "mem_gb": 40},
        "4g.40gb": {"sms": 56, "mem_gb": 40},
        "7g.80gb": {"sms": 98, "mem_gb": 80},
    }

    a100_fp16_tflops = 312  # Full GPU

    print(f"{'Profile':<12} {'SMs':<6} {'Fraction':<10} "
          f"{'FP16 TFLOPS':<15} {'Memory':<10}")
    for name, spec in profiles.items():
        fraction = spec['sms'] / a100_total_sms
        tflops = a100_fp16_tflops * fraction
        print(f"{name:<12} {spec['sms']:<6} {fraction*100:>5.1f}%"
              f"     {tflops:>8.1f}"
              f"        {spec['mem_gb']:>3d} GB")

mig_compute_scaling()

Memory Bandwidth Scaling

Each MIG instance gets dedicated memory controllers, so bandwidth scales linearly:

📊

A100-80GB MIG Instance Performance (Measured)

Profile	SMs	HBM BW (GB/s)	FP16 TFLOPS	Decode tok/s (7B FP16)
1g.10gb (1/7)	14	286	40.5	~20
2g.20gb (2/7)	28	571	81.0	~40
3g.40gb (3/7)	42	857	121.5	~60
4g.40gb (4/7)	56	1143	162.0	~80
7g.80gb (full)	98	2039	312.0	~145

Note: Performance scales nearly linearly with instance size. A 3g.40gb instance delivers approximately 3/7 of full GPU performance. Decode throughput scales linearly because it is memory-bandwidth-bound.

MIG Instance Performance Scaling (A100-80GB, Llama-2 7B Decode)

(tokens/sec)

1g.10gb 1/7 GPU

20 tokens/sec

2g.20gb 2/7

40 tokens/sec

3g.40gb 3/7

60 tokens/sec

7g.80gb full GPU

145 tokens/sec

Isolation Quality

def mig_isolation_test():
    """Demonstrate that MIG provides true performance isolation."""
    # Test: Run a stress kernel on one MIG instance
    # and measure performance on an adjacent instance

    # Without MIG (time-sharing):
    # Stress kernel on GPU causes 50-80% slowdown for other workloads

    # With MIG:
    # Stress kernel on instance A has 0% impact on instance B
    # Because they use physically separate SMs, memory, and cache

    isolation_results = {
        "Time-sharing (no MIG)": {
            "neighbor_idle": "145 tok/s",
            "neighbor_stress": "45 tok/s",
            "degradation": "69%",
        },
        "MPS (Multi-Process Service)": {
            "neighbor_idle": "145 tok/s",
            "neighbor_stress": "92 tok/s",
            "degradation": "37%",
        },
        "MIG (3g.40gb + 3g.40gb)": {
            "neighbor_idle": "60 tok/s",
            "neighbor_stress": "60 tok/s",
            "degradation": "0%",
        },
    }

    print(f"{'Method':<30} {'Idle Neighbor':<18} "
          f"{'Stressed Neighbor':<18} {'Degradation'}")
    for method, results in isolation_results.items():
        print(f"{method:<30} {results['neighbor_idle']:<18} "
              f"{results['neighbor_stress']:<18} "
              f"{results['degradation']}")

mig_isolation_test()

⚡ MIG Isolation Is Hardware-Level

MIG isolation is enforced by the GPU hardware, not software. An instance cannot access another instance’s memory (hardware page table enforcement), cannot use another instance’s SMs (hardware scheduling), and cannot interfere with another instance’s memory bandwidth (dedicated memory controllers). This is the same level of isolation as separate physical GPUs.

Limitations

What MIG Cannot Do

def mig_limitations():
    """Document MIG limitations."""
    limitations = {
        "No cross-instance communication": {
            "detail": "MIG instances cannot share memory or "
                     "communicate directly. No equivalent of NVLink "
                     "between instances. Cannot do tensor parallelism "
                     "across MIG instances on the same GPU.",
        },
        "No dynamic resizing": {
            "detail": "Changing MIG configuration requires destroying "
                     "all instances and recreating them. This means "
                     "GPU downtime (~30 seconds). Cannot dynamically "
                     "grow an instance when load increases.",
        },
        "GPU feature restrictions per instance": {
            "detail": "Some GPU features are not available per instance: "
                     "GPU performance counters (Nsight Compute), "
                     "CUDA debugger, some video codec instances. "
                     "Profiling requires disabling MIG or using "
                     "the full GPU instance.",
        },
        "Only on A100 and H100": {
            "detail": "MIG is not available on consumer GPUs (RTX), "
                     "T4, V100, or older data center GPUs. "
                     "Requires Ampere or Hopper data center SKU.",
        },
        "Minimum instance size": {
            "detail": "The smallest instance (1g.10gb) still has "
                     "14 SMs and 10 GB memory. For very small models "
                     "(under 1 GB), this wastes most of the instance. "
                     "MIG is best suited for models that use 5-70 GB.",
        },
        "No graphics/display support": {
            "detail": "MIG instances cannot run graphics workloads. "
                     "Only compute (CUDA) and video encode/decode.",
        },
    }

    for name, info in limitations.items():
        print(f"\n{name}:")
        print(f"  {info['detail']}")

mig_limitations()

Cost-Benefit Analysis

def mig_cost_analysis():
    """Analyze when MIG is cost-effective vs dedicated GPUs."""
    # Scenario: serving multiple small LLMs

    scenarios = {
        "7 x small models (under 2B params, under 5GB each)": {
            "without_mig": {
                "gpus": 7,
                "cost_hr": 7 * 3.0,  # $3/hr per A100
                "utilization": "10-15% each",
            },
            "with_mig": {
                "gpus": 1,
                "cost_hr": 1 * 3.0,
                "config": "7x 1g.10gb",
                "utilization": "70-100% (7 instances)",
            },
            "savings": "7x cost reduction",
        },
        "3 x medium models (7B params, ~14GB each)": {
            "without_mig": {
                "gpus": 3,
                "cost_hr": 3 * 3.0,
            },
            "with_mig": {
                "gpus": "Not feasible (14GB > 10GB per 1g instance)",
                "config": "Need 2g.20gb or larger",
                "alternative": "2x A100 with 3g.40gb + 2g.20gb each",
            },
            "savings": "~1.5x cost reduction (2 GPUs instead of 3)",
        },
        "1 x large model (70B params, 140GB FP16)": {
            "without_mig": {
                "gpus": "2x A100 (tensor parallel)",
                "cost_hr": 2 * 3.0,
            },
            "with_mig": {
                "gpus": "Not applicable (model does not fit in any "
                       "MIG instance, needs full GPU or multi-GPU)",
            },
            "savings": "MIG not useful for large models",
        },
    }

    for scenario_name, details in scenarios.items():
        print(f"\n=== {scenario_name} ===")
        for key, val in details.items():
            if isinstance(val, dict):
                print(f"  {key}:")
                for k2, v2 in val.items():
                    print(f"    {k2}: {v2}")
            else:
                print(f"  {key}: {val}")

mig_cost_analysis()

📊

MIG Cost-Effectiveness by Workload

Workload	Without MIG	With MIG	Cost Savings
7 small models (under 2B)	7 GPUs at $21/hr	1 GPU at $3/hr	7x
3 medium models (7B)	3 GPUs at $9/hr	1-2 GPUs at $3-6/hr	1.5-3x
Dev/test (multiple developers)	1 GPU per developer	1 GPU shared via MIG	3-7x
1 large model (70B+)	2+ GPUs (TP)	Not applicable	0x (MIG too small)
Batch processing (GPU fully utilized)	1 GPU	1 GPU (MIG adds overhead)	Negative

Note: MIG is most cost-effective when multiple small-to-medium models are served concurrently and each model would under-utilize a full GPU.

MIG vs Alternatives

📊

GPU Sharing Methods Comparison

Method	Isolation	Overhead	Granularity	Use Case
MIG	Hardware (full)	0% (dedicated resources)	1/7 to full GPU	Production multi-tenant
MPS (Multi-Process Service)	None (shared SMs)	5-10% context switch	Any %	Cooperative workloads
Time-sharing (CUDA default)	None (sequential)	Context switch overhead	100% (one at a time)	Simple cases
vGPU (NVIDIA GRID)	Software (driver-level)	5-15%	Configurable	VDI, virtual machines
Kubernetes GPU scheduling	None (1 pod per GPU)	0%	1 GPU per pod	Simple scheduling

Note: MIG provides the strongest isolation with zero performance overhead (resources are physically dedicated). MPS provides finer granularity but no isolation.

💡 When to Use MIG vs MPS

Use MIG when you need guaranteed performance isolation between tenants (e.g., SLA-bound production serving, multi-customer platforms, or security-sensitive workloads where one tenant must not be able to observe another’s memory).

Use MPS when you control all workloads on the GPU and need finer-grained sharing (e.g., running 20 small inference requests concurrently that each use 5% of the GPU).

Summary

MIG partitions a single A100 or H100 into up to 7 hardware-isolated instances, each with dedicated compute, memory, and bandwidth. The isolation is enforced by GPU hardware — equivalent to separate physical GPUs. MIG is cost-effective for multi-tenant inference serving (7x cost reduction for small models), dev/test environments (share one GPU across a team), and any scenario where a single model under-utilizes a full GPU.

The limitations are clear: no cross-instance communication (no tensor parallelism within one GPU’s MIG instances), no dynamic resizing (reconfiguration requires instance destruction), and minimum instance size of 10 GB / 14 SMs. MIG is not useful for large models (70B+) that need the full GPU or multiple GPUs. For workloads that fully utilize a GPU, MIG adds complexity without benefit. The sweet spot is serving multiple 1-13B parameter models on a single GPU with guaranteed per-tenant performance isolation.