Sutskever 30 Deep Dives • Paper #23

The MDL Principle

Minimum Description Length

January 31, 2026 • 7 min read

A complex model fits data perfectly but memorizes noise. A simple model underfits. MDL resolves this tension: the best model minimizes the total bits needed to describe itself plus its errors. Compression equals understanding.

Data compression through model simplicity
MDL measures model quality in bits: model description length plus data-given-model description length. Simpler models need fewer bits unless complexity buys substantially better fit.

Why Sutskever Included This

MDL provides theoretical grounding for regularization, pruning, and architecture selection. It explains why simpler models generalize: they compress the true signal without memorizing noise. The principle connects information theory to practical machine learning decisions.

The Formula

MDL selects the model minimizing total description length:

MDL(Model) = L(Model) + L(Data | Model)

L(Model) penalizes complexity. L(Data | Model) penalizes poor fit. The sum captures total information.

More parameters increase L(Model). Better predictions decrease L(Data | Model). The optimum balances these competing pressures.

Information-Theoretic Foundation

Shannon's coding theory quantifies information. Predictable data compresses well; random data doesn't. A model that predicts data accurately enables short encodings of residuals.

If your model assigns high probability to the actual outcomes, the code lengths are short. Poor predictions mean long codes. MDL rewards models that assign high probability to what actually happens.

Occam's Razor Made Precise

"Prefer simpler explanations" is ancient wisdom. MDL makes it quantitative. Extra model complexity must earn its keep through compression gains. If adding a parameter doesn't shorten the total description, remove it.

Information-theoretic optimality, not aesthetic preference.

Practical Applications

Architecture selection: Compare network sizes by their MDL scores rather than just validation loss. The description length of the architecture itself matters.

Pruning: Remove weights when the parameter savings exceed the accuracy loss. MDL provides the exchange rate between model bits and prediction bits.

Feature selection: Include features only when they improve overall compression. Irrelevant features cost description bits without payoff.

Early stopping: Training beyond the MDL optimum means memorizing noise. The model description grows (implicitly, through precise weight values) without proportional compression gains.

Connection to Regularization

L1 and L2 penalties approximate MDL's complexity term. They penalize large weights, which require more bits to specify precisely. The regularization coefficient sets the exchange rate between model complexity and fit.

MDL explains why regularization works: it's not arbitrary: it's an approximation to optimal compression.

Limitations

Computing exact description lengths requires specifying encoding schemes. Different choices yield different MDL scores. The principle is sound; practical application requires careful implementation.


Part of a series on Ilya Sutskever's recommended 30 papers.

Try these models yourself

Claude, GPT, Gemini, Llama, and 300+ more. One app, you pick the model.