GPipe

Your model has 500 million parameters. Your GPU has 16GB of memory. The model needs 40GB. Before GPipe, you were stuck. This paper introduces pipeline parallelism: split the model across devices and keep them all busy.

Neural network partitioned across multiple devices with data flowing through pipeline stages — GPipe partitions a model across K devices. Data flows through the pipeline as micro-batches, keeping all devices active.

Why Sutskever Included This

Model scale drives capability. GPT-3 has 175 billion parameters. Training it requires distributing computation across many devices. GPipe introduced the pipeline parallelism approach that made such models trainable. The techniques here underpin modern large language model infrastructure.

The Memory Problem

Training consumes more memory than inference. You store activations from the forward pass for use in the backward pass. For large models, these activations alone exceed GPU memory.

Data parallelism (replicating the full model on each device) doesn't help. The model still needs to fit on each device. You need a way to split the model itself.

Pipeline Parallelism

Divide the model into K sequential segments. Assign each segment to a different device. Data flows through: Device 1 processes layers 1-10, sends activations to Device 2, which processes layers 11-20, and so on.

The problem: with one batch, most devices sit idle. Device 2 waits for Device 1 to finish. Device 3 waits for Device 2. During the backward pass, the same happens in reverse. Most time is wasted.

Single batch across 4 devices:
Device 1: [WORK][-idle-][-idle-][-idle-][BACK]
Device 2: [-idle-][WORK][-idle-][-idle-][-idle-][BACK]
Device 3: [-idle-][-idle-][WORK][-idle-][-idle-][-idle-][BACK]
Device 4: [-idle-][-idle-][-idle-][WORK][-idle-][-idle-][-idle-][BACK]

Micro-Batching

Split each mini-batch into M smaller micro-batches. Feed them through the pipeline in sequence. While Device 2 processes micro-batch 1, Device 1 processes micro-batch 2.

The pipeline fills up. Devices work in parallel on different micro-batches. The bubble (idle time at start and end) shrinks relative to useful work.

Bubble fraction = (K - 1) / (K - 1 + M)

With 4 devices and 16 micro-batches: bubble = 3/19 ≈ 16%. The pipeline runs at 84% efficiency.

F-then-B Schedule

GPipe uses a specific execution order. All M micro-batches complete their forward passes through the entire pipeline. Then all M micro-batches complete their backward passes in reverse order.

This simplifies implementation. Each device finishes all forward work before switching to backward work. The alternative (interleaving forward and backward for each micro-batch) requires more complex scheduling.

Re-materialization

Storing activations for M micro-batches across K partitions eats memory. GPipe trades computation for memory: store only the activations at partition boundaries. During the backward pass, recompute the intermediate activations within each partition.

This technique (also called gradient checkpointing) existed before GPipe but is essential to making pipeline parallelism practical. Memory scales with the number of partitions, not the number of layers.

Practical Tradeoffs

More micro-batches reduce bubble overhead but increase memory for storing gradients. The paper suggests M ≈ 4K as a reasonable balance.

More partitions allow larger models but increase bubble fraction. You want enough partitions to fit the model, not more.

Re-materialization adds ~25% computation overhead. Worth it when memory is the constraint.

What GPipe Enabled

The paper trained AmoebaNet-B with 557 million parameters on ImageNet, achieving state-of-the-art accuracy in 2018. More significantly, it demonstrated that pipeline parallelism scales to models that can't fit on single devices.

Modern systems combine pipeline parallelism with data parallelism and tensor parallelism. GPipe established one leg of this three-legged approach to training massive models.

Limitations

Pipeline parallelism works for models with sequential structure. Highly branched architectures don't partition cleanly. The technique also introduces latency: results take time to flow through the pipeline.

Synchronous training (which GPipe uses) blocks until all devices complete. Asynchronous variants exist but introduce optimization challenges.

Connection to Modern Training

The systems training GPT-4 and Claude use pipeline parallelism alongside other techniques. Megatron-LM, DeepSpeed, and other frameworks build on GPipe's core ideas.

Paper #22 in this series (Scaling Laws) explores why we want such large models. GPipe explains how to train them.

More in This Series

Part of a series on Ilya Sutskever's recommended 30 papers, connecting each to practical AI development.