Keeping Neural Networks Simple

Neural networks generalize well when the weights contain less information than the training outputs. Hinton and van Camp's 1993 paper turned this observation into a training method: add noise to the weights and let the network learn how much precision each weight actually needs.

Neural network with weights of varying precision, some fading to represent low-information connections — Weights that need less precision to describe will generalize better. The network learns which weights matter.

Why Sutskever Included This

This paper introduced ideas that run through modern deep learning: information-theoretic regularization and the connection between compression and generalization. Variational autoencoders and Bayesian neural networks trace back to concepts in this 1993 work.

The Minimum Description Length Principle

MDL comes from information theory. The best model minimizes the total cost of describing both the model and its errors. A model that memorizes training data needs many bits to describe its weights but few bits to describe its (zero) errors. A model that captures only the underlying pattern needs fewer bits for weights but more bits for residual errors.

The optimal model sits somewhere between these extremes. MDL gives a principled way to find that balance without manual tuning of regularization strength.

Weights as Communication

Hinton and van Camp reframe training as a communication problem. Imagine sending a neural network's weights to someone who will use them for prediction. How many bits does that transmission require?

A weight of exactly 0.847293 requires many bits to specify. A weight described as "around 0.8, give or take 0.1" requires fewer bits. If the network still performs well with the imprecise version, the extra precision was wasted.

The training objective becomes: minimize prediction error plus the bits needed to communicate the weights.

Gaussian Noise as Compression

The method works by treating each weight as a Gaussian distribution rather than a point value. During training, the network samples from these distributions. Weights with high variance (wide distributions) need fewer bits to describe. Weights with low variance (tight distributions) need more bits.

The network learns both the mean and variance of each weight's distribution. Weights that matter for prediction get tight distributions, while unimportant weights get wide distributions and effectively disappear.

# Traditional weight
w = 0.847293

# MDL weight (mean + variance)
w ~ N(μ=0.85, σ²=0.01)

# Bits to communicate ≈ -log(σ)
# Higher variance = fewer bits = simpler model

Why This Works

Networks that memorize training data need precise weights to encode specific examples. Networks that learn general patterns can tolerate imprecision because the pattern, not the exact numbers, drives prediction.

By penalizing weight precision, the method forces the network toward general patterns. Overfitting becomes expensive in the objective function, not just something to avoid through early stopping or held-out validation.

Connection to Weight Decay

Traditional L2 regularization (weight decay) penalizes large weights. MDL regularization penalizes precise weights. These sound similar but differ in important ways.

Weight decay pushes all weights toward zero with equal pressure. MDL lets each weight find its own level of necessary precision. A weight that needs to be exactly 5.0 can be exactly 5.0. A weight that just needs to be "positive" can have high variance around any positive mean.

MDL adapts to the structure of the problem rather than applying uniform shrinkage.

Computing the Derivatives

The paper's technical contribution is showing how to compute exact gradients for both the expected error and the information content of noisy weights. With linear output units, this requires no Monte Carlo sampling. Training runs as fast as standard backpropagation.

This made the method practical in 1993, when compute was expensive. Modern implementations often use sampling-based approximations, but the exact method remains useful for smaller networks.

Legacy in Modern Deep Learning

Variational autoencoders use the same reparameterization trick: treat latent variables as distributions, sample during training, and penalize the KL divergence from a prior. The VAE objective is essentially MDL applied to latent representations.

Bayesian neural networks extend this to all weights, maintaining full posterior distributions. Dropout can be interpreted as a crude approximation to Bayesian inference, randomly zeroing weights instead of sampling from learned distributions.

Modern pruning methods that remove weights below some threshold are doing manual compression. MDL does this automatically, learning which weights can be imprecise (and thus pruned) and which need to stay sharp.

The Compression-Generalization Link

A theme runs through Sutskever's list: models that compress their inputs generalize better than models that memorize. Paper #1 (complexodynamics) discussed how complexity peaks then falls. This paper makes the connection to learning explicit.

A network that memorizes has high complexity in its weights. A network that generalizes has compressed its weights to just the information needed for the underlying pattern. MDL provides the objective function that rewards this compression.

More in This Series

Part of a series on Ilya Sutskever's recommended 30 papers, connecting each to practical AI development.