How much information does a string contain? Kolmogorov complexity answers: the length of the shortest program that outputs it. A billion zeros compresses to a few bytes. A billion random bits don't compress at all. This distinction underlies learning itself.
Why Sutskever Included This
Kolmogorov complexity grounds machine learning theory. It formalizes Occam's razor, explains why regularization works, and reveals fundamental limits on learning. Understanding it clarifies what neural networks can and cannot do.
The Definition
K(x) is the length of the shortest program that outputs string x and halts:
K(x) = min { |p| : U(p) = x }
U is a universal Turing machine. |p| is program length. The minimum is over all programs that output x.
K(x) measures intrinsic information content, capturing how much structure x has, independent of representation.
Randomness = Incompressibility
A string is algorithmically random if K(x) ≈ |x|. No program shorter than the string itself can generate it. There's no pattern to exploit, no compression possible.
Structured data compresses. "Print 0 one billion times" is shorter than a billion zeros. Random data doesn't compress, and the shortest description is the data itself.
Uncomputability
K(x) cannot be computed. No algorithm exists that returns K(x) for all inputs. This connects to the halting problem: determining whether programs halt is undecidable, and computing K(x) would require solving it.
We can approximate K(x) from above: any compression algorithm provides an upper bound. We cannot compute it exactly.
Invariance Theorem
Different programming languages give different K(x) values. But the invariance theorem shows these differ by at most a constant independent of x. For long strings, the choice of language doesn't matter much.
K(x) is an intrinsic property of x, not an artifact of representation.
Connection to Learning
Occam's razor formalized: Simpler hypotheses (lower K) deserve higher prior probability. Solomonoff induction makes this precise: weight hypotheses by 2^(-K).
Generalization: Models with low description length generalize better. High K means memorization; low K means the model found genuine structure.
No free lunch: No universal learning algorithm exists because finding low-K solutions for high-K (random) problems is impossible. Learning requires matching inductive bias to problem structure.
Practical Approximations
Since K(x) is uncomputable, practitioners use:
Compression algorithms: gzip, LZMA provide upper bounds on K(x).
MDL: Minimum Description Length approximates K(x) with computable alternatives.
Regularization: L1/L2 penalties are crude heuristics for limiting model complexity.
Further Reading
More in This Series
Part of a series on Ilya Sutskever's recommended 30 papers.