A complex model fits data perfectly but memorizes noise. A simple model underfits. MDL resolves this tension: the best model minimizes the total bits needed to describe itself plus its errors. Compression equals understanding.
Why Sutskever Included This
MDL provides theoretical grounding for regularization, pruning, and architecture selection. It explains why simpler models generalize: they compress the true signal without memorizing noise. The principle connects information theory to practical machine learning decisions.
The Formula
MDL selects the model minimizing total description length:
MDL(Model) = L(Model) + L(Data | Model)
L(Model) penalizes complexity. L(Data | Model) penalizes poor fit. The sum captures total information.
More parameters increase L(Model). Better predictions decrease L(Data | Model). The optimum balances these competing pressures.
Information-Theoretic Foundation
Shannon's coding theory quantifies information. Predictable data compresses well; random data doesn't. A model that predicts data accurately enables short encodings of residuals.
If your model assigns high probability to the actual outcomes, the code lengths are short. Poor predictions mean long codes. MDL rewards models that assign high probability to what actually happens.
Occam's Razor Made Precise
"Prefer simpler explanations" is ancient wisdom. MDL makes it quantitative. Extra model complexity must earn its keep through compression gains. If adding a parameter doesn't shorten the total description, remove it.
Information-theoretic optimality, not aesthetic preference.
Practical Applications
Architecture selection: Compare network sizes by their MDL scores rather than just validation loss. The description length of the architecture itself matters.
Pruning: Remove weights when the parameter savings exceed the accuracy loss. MDL provides the exchange rate between model bits and prediction bits.
Feature selection: Include features only when they improve overall compression. Irrelevant features cost description bits without payoff.
Early stopping: Training beyond the MDL optimum means memorizing noise. The model description grows (implicitly, through precise weight values) without proportional compression gains.
Connection to Regularization
L1 and L2 penalties approximate MDL's complexity term. They penalize large weights, which require more bits to specify precisely. The regularization coefficient sets the exchange rate between model complexity and fit.
MDL explains why regularization works: it's not arbitrary: it's an approximation to optimal compression.
Limitations
Computing exact description lengths requires specifying encoding schemes. Different choices yield different MDL scores. The principle is sound; practical application requires careful implementation.
Further Reading
More in This Series
Part of a series on Ilya Sutskever's recommended 30 papers.