Sutskever 30 Deep Dives • Paper #11

Dilated Convolutions

Multi-Scale Context Without Pooling

January 31, 2026 • 6 min read

Standard convolutions see small neighborhoods. Pooling expands the view but loses resolution. For tasks like semantic segmentation, where you need both wide context and pixel-precise output, neither works well. Dilated convolutions give you both.

Dilated convolutions with expanding receptive fields through sparse sampling
Increasing dilation rates produce exponentially growing receptive fields while maintaining output resolution.

Why Sutskever Included This

Dense prediction tasks (segmentation, depth estimation, optical flow) require understanding global context while producing outputs at input resolution. Dilated convolutions solve this elegantly. The technique appears in WaveNet for audio, DeepLab for segmentation, and many modern architectures.

The Pooling Problem

Classification networks downsample aggressively. After several pooling layers, a 224x224 image becomes 7x7 feature maps. Global context is captured, but spatial detail is gone.

For segmentation, you need to label every pixel. Upsampling from 7x7 back to 224x224 can't recover lost details. Skip connections help but don't fully solve the problem.

Convolutions with Gaps

A dilated convolution spaces out its kernel. A 3x3 kernel with dilation 2 samples positions 0, 2, 4 in each dimension instead of 0, 1, 2. The kernel covers a 5x5 area but uses only 9 parameters.

Receptive field = (kernel_size - 1) × dilation + 1

Dilation 1: 3×3 kernel → 3×3 receptive field
Dilation 2: 3×3 kernel → 5×5 receptive field
Dilation 4: 3×3 kernel → 9×9 receptive field

Stack layers with dilations 1, 2, 4, 8. The receptive field grows exponentially while parameters stay constant. No downsampling needed.

Multi-Scale Processing

Each dilation rate captures context at a different scale. Dilation 1 sees local texture. Dilation 4 sees object parts. Dilation 16 sees scene layout. The network aggregates all scales simultaneously.

Contrast with pooling pyramids, which process scales sequentially and require upsampling to combine them. Dilated convolutions maintain full resolution throughout.

Applications

Semantic segmentation: DeepLab uses dilated convolutions (called atrous convolutions) to achieve state-of-the-art results. The network classifies every pixel using wide context without losing spatial precision.

Audio generation: WaveNet stacks dilated causal convolutions to model long audio sequences. Dilation grows exponentially across layers, giving the model context over thousands of timesteps.

Machine translation: ByteNet uses dilated convolutions for sequence-to-sequence modeling. The approach offers an alternative to recurrence with parallelizable operations.

Comparison to Other Approaches

vs. Large kernels: A 9x9 kernel has 81 parameters. A 3x3 kernel with dilation 4 has 9 parameters and the same receptive field. Dilated convolutions are more efficient.

vs. Pooling + Upsampling: Pooling discards information permanently. Even with skip connections, fine details blur. Dilated convolutions preserve all spatial information.

vs. Attention: Attention is more flexible but O(n²) in sequence length. Dilated convolutions scale linearly and impose useful local bias.

Implementation Details

Most deep learning frameworks support dilation as a parameter in convolution layers. No special implementation required. Memory usage matches standard convolutions of the same kernel size.

Watch for gridding artifacts when dilation rates share common factors. Using rates like 1, 2, 5, 9 (coprime) avoids blind spots in the receptive field.

Connection to Modern Architectures

ConvNeXt and other recent vision models incorporate dilated convolutions selectively. The technique complements attention layers, handling local processing efficiently while attention captures global dependencies.


Part of a series on Ilya Sutskever's recommended 30 papers.

Try these models yourself

Claude, GPT, Gemini, Llama, and 300+ more. One app, you pick the model.