Sutskever 30 Deep Dives • Paper #15

Identity Mappings in ResNet

Pre-Activation for Deeper Networks

January 31, 2026 • 6 min read

ResNet's skip connections should let signals flow unimpeded. But the original design placed ReLU after the addition, blocking the identity path. Moving activation before convolution cleared the highway. The result: trainable 1000-layer networks.

Pre-activation ResNet block with batch norm and ReLU before convolution
Pre-activation moves BN and ReLU before the convolution. The skip connection remains unobstructed.

Why Sutskever Included This

Paper #10 introduced ResNet. This follow-up paper refines the architecture with a subtle but significant change. The pre-activation design became standard for very deep networks and influenced transformer architectures (which place layer norm before attention).

The Original Design

In the first ResNet paper, each block looked like:

x → Conv → BN → ReLU → Conv → BN → (+x) → ReLU → output

The final ReLU applies after adding the skip connection. This seems harmless but has consequences for gradient flow.

The Problem with Post-Activation

ReLU zeros out negative values. When gradients flow backward through the post-addition ReLU, some get suppressed. The skip connection no longer provides a clean gradient highway.

For moderate depths (50-150 layers), this works. For extreme depths (1000+ layers), the accumulated suppression degrades training.

Pre-Activation Design

Move batch normalization and ReLU before the convolutions:

x → BN → ReLU → Conv → BN → ReLU → Conv → (+x) → output

Now the skip connection adds directly to the output with no intervening activation. Gradients flow through the identity branch without modification.

Mathematical Effect

With pre-activation, the gradient includes an unmodified identity term:

∂L/∂x = ∂L/∂y × (∂F/∂x + I)

The identity matrix I passes gradients unchanged. No ReLU derivative zeros them out. Signal propagation improves at every layer.

1000-Layer Networks

The paper trained a 1001-layer ResNet on CIFAR-10. With post-activation, optimization stalled. With pre-activation, the network trained successfully.

Accuracy matched shallower networks, demonstrating that extreme depth doesn't hurt when gradients flow properly. The architecture removed depth as a practical constraint.

Regularization Effect

Pre-activation also provides implicit regularization. The BN-ReLU-Conv sequence means each convolution receives normalized, rectified inputs. This stabilizes training and can reduce the need for other regularization.

Adoption

Pre-activation ResNets became standard for applications requiring very deep networks. The pattern influenced transformer design: layer normalization before attention (Pre-LN) trains more stably than post-attention normalization.

Small architectural details compound across many layers. Getting the ordering right matters.


Part of a series on Ilya Sutskever's recommended 30 papers.

Try these models yourself

Claude, GPT, Gemini, Llama, and 300+ more. One app, you pick the model.