Sutskever 30 | The Papers That Shaped AI

U-shaped attention pattern in long context

#30

Lost in the Middle

Language models struggle with information in the middle of long contexts. The U-shaped attention pattern.

RAG architecture combining retrieval with generation

#29

Retrieval-Augmented Generation

Combining retrieval with generation. External knowledge grounds language models in facts.

Semantic embeddings for questions and passages

#28

Dense Passage Retrieval

Learning embeddings for questions and passages. Dual encoders and contrastive learning beat BM25.

Multiple tokens being predicted in parallel

#27

Multi-Token Prediction

Predicting multiple tokens at once improves sample efficiency and enables speculative decoding.

Convolutional filters detecting features

#26

CNN Fundamentals (CS231n)

Convolutional layers, pooling, ReLU, and backpropagation. Stanford's CS231n course distilled.

String compression and algorithmic complexity

#25

Kolmogorov Complexity

The shortest program that outputs a string. Incompressibility equals randomness.

Intelligence emergence and recursive self-improvement

#24

Machine Superintelligence

Formal definitions of machine intelligence. Recursive self-improvement and the path to superintelligence.

Data compression through model simplicity

#23

The MDL Principle

Minimum Description Length balances model complexity against fit. Occam's razor made mathematical.

Power law scaling of neural network performance

#22

Scaling Laws

Power laws for neural language models. How performance scales with compute, data, and parameters.

Audio waveform being transcribed through CTC

#21

CTC Loss

Training speech models without frame-level alignment. The blank token and alignment-free learning.

#20

Neural Turing Machines

Differentiable computers with external memory banks. Content and location-based addressing.

#19

The Coffee Automaton

Why does coffee mix but never unmix? Entropy, coarse-graining, and the arrow of time.

#18

Relational Recurrent Neural Networks

Memory slots that attend to each other enable multi-step reasoning.

#17

Variational Autoencoders

Learning to generate data by encoding it into structured latent spaces. ELBO and the reparameterization trick.

Object pairs being compared by relation network

#16

Relational Reasoning

Relation Networks compare all object pairs to answer questions about relationships.

#15

Identity Mappings in Deep Residual Networks

Moving activation before convolution enables training 1000-layer networks.

Attention weights between source and target

#14

Bahdanau Attention

Neural machine translation by jointly learning to align and translate. The original attention mechanism.

#13

Attention Is All You Need

The Transformer architecture replaced recurrence with self-attention. Foundation of GPT, BERT, and modern AI.

#12

Graph Neural Networks

Message passing on graphs. Nodes aggregate information from neighbors to learn representations.

#11

Dilated Convolutions

Exponentially expanding receptive fields without losing resolution. WaveNet's secret for audio generation.

#10

Deep Residual Learning (ResNet)

Skip connections let gradients flow through 152 layers. Learning residuals instead of direct mappings.

#9

GPipe

Pipeline parallelism for training giant neural networks. Micro-batches keep accelerators busy.

#8

Order Matters: Seq2Seq for Sets

Sets have no order, but neural networks need sequences. How to handle permutation invariance.

Convolutional neural network layers visualization

#7

AlexNet

Krizhevsky, Sutskever, and Hinton's 2012 ImageNet victory proved deep learning could outperform hand-engineered computer vision.

Attention mechanism as pointer visualization

#6

Pointer Networks

Vinyals, Fortunato, and Jaitly repurposed attention to point at input positions instead of blending hidden states.

Neural network weight compression visualization

#5

Keeping Neural Networks Simple

Hinton and van Camp showed that penalizing weight complexity leads to better generalization.

Neural network dropout diagram showing selective application

#4

Recurrent Neural Network Regularization

Zaremba, Sutskever, and Vinyals figured out how to apply dropout to LSTMs without breaking them.

LSTM cell state conveyor belt visualization

#3

Understanding LSTM Networks

Christopher Olah's 2015 post explained LSTM gates with clarity that textbooks lacked.

Shakespeare rendered in neural network style

#2

The Unreasonable Effectiveness of RNNs

Karpathy's famous 2015 post showed RNNs could generate Shakespeare and Linux code by predicting one character at a time.

Coffee mixing visualization - entropy vs complexity

#1

Why Coffee Mixes But Never Unmixes

Scott Aaronson's First Law of Complexodynamics explains why complexity rises, peaks, then falls.