Sutskever 30 Deep Dives • Paper #25

Kolmogorov Complexity

Algorithmic Information Theory

January 31, 2026 • 7 min read

How much information does a string contain? Kolmogorov complexity answers: the length of the shortest program that outputs it. A billion zeros compresses to a few bytes. A billion random bits don't compress at all. This distinction underlies learning itself.

String compression and algorithmic complexity
Structured data compresses because short programs generate it. Random data resists compression: no pattern means no shortcut.

Why Sutskever Included This

Kolmogorov complexity grounds machine learning theory. It formalizes Occam's razor, explains why regularization works, and reveals fundamental limits on learning. Understanding it clarifies what neural networks can and cannot do.

The Definition

K(x) is the length of the shortest program that outputs string x and halts:

K(x) = min { |p| : U(p) = x }

U is a universal Turing machine. |p| is program length. The minimum is over all programs that output x.

K(x) measures intrinsic information content, capturing how much structure x has, independent of representation.

Randomness = Incompressibility

A string is algorithmically random if K(x) ≈ |x|. No program shorter than the string itself can generate it. There's no pattern to exploit, no compression possible.

Structured data compresses. "Print 0 one billion times" is shorter than a billion zeros. Random data doesn't compress, and the shortest description is the data itself.

Uncomputability

K(x) cannot be computed. No algorithm exists that returns K(x) for all inputs. This connects to the halting problem: determining whether programs halt is undecidable, and computing K(x) would require solving it.

We can approximate K(x) from above: any compression algorithm provides an upper bound. We cannot compute it exactly.

Invariance Theorem

Different programming languages give different K(x) values. But the invariance theorem shows these differ by at most a constant independent of x. For long strings, the choice of language doesn't matter much.

K(x) is an intrinsic property of x, not an artifact of representation.

Connection to Learning

Occam's razor formalized: Simpler hypotheses (lower K) deserve higher prior probability. Solomonoff induction makes this precise: weight hypotheses by 2^(-K).

Generalization: Models with low description length generalize better. High K means memorization; low K means the model found genuine structure.

No free lunch: No universal learning algorithm exists because finding low-K solutions for high-K (random) problems is impossible. Learning requires matching inductive bias to problem structure.

Practical Approximations

Since K(x) is uncomputable, practitioners use:

Compression algorithms: gzip, LZMA provide upper bounds on K(x).

MDL: Minimum Description Length approximates K(x) with computable alternatives.

Regularization: L1/L2 penalties are crude heuristics for limiting model complexity.


Part of a series on Ilya Sutskever's recommended 30 papers.

Try these models yourself

Claude, GPT, Gemini, Llama, and 300+ more. One app, you pick the model.