30 Papers

Sutskever 30

Ilya Sutskever's recommended reading list. The 30 research papers he considers essential for understanding AI.

Back to all posts
U-shaped attention pattern in long context
#30

Lost in the Middle

Language models struggle with information in the middle of long contexts. The U-shaped attention pattern.

Read more
RAG architecture combining retrieval with generation
#29

Retrieval-Augmented Generation

Combining retrieval with generation. External knowledge grounds language models in facts.

Read more
Semantic embeddings for questions and passages
#28

Dense Passage Retrieval

Learning embeddings for questions and passages. Dual encoders and contrastive learning beat BM25.

Read more
Multiple tokens being predicted in parallel
#27

Multi-Token Prediction

Predicting multiple tokens at once improves sample efficiency and enables speculative decoding.

Read more
Convolutional filters detecting features
#26

CNN Fundamentals (CS231n)

Convolutional layers, pooling, ReLU, and backpropagation. Stanford's CS231n course distilled.

Read more
String compression and algorithmic complexity
#25

Kolmogorov Complexity

The shortest program that outputs a string. Incompressibility equals randomness.

Read more
Intelligence emergence and recursive self-improvement
#24

Machine Superintelligence

Formal definitions of machine intelligence. Recursive self-improvement and the path to superintelligence.

Read more
Data compression through model simplicity
#23

The MDL Principle

Minimum Description Length balances model complexity against fit. Occam's razor made mathematical.

Read more
Power law scaling of neural network performance
#22

Scaling Laws

Power laws for neural language models. How performance scales with compute, data, and parameters.

Read more
Audio waveform being transcribed through CTC
#21

CTC Loss

Training speech models without frame-level alignment. The blank token and alignment-free learning.

Read more
Neural Turing Machine with memory bank
#20

Neural Turing Machines

Differentiable computers with external memory banks. Content and location-based addressing.

Read more
Cream diffusing into coffee
#19

The Coffee Automaton

Why does coffee mix but never unmix? Entropy, coarse-graining, and the arrow of time.

Read more
Relational memory with multiple slots
#18

Relational Recurrent Neural Networks

Memory slots that attend to each other enable multi-step reasoning.

Read more
VAE latent space visualization
#17

Variational Autoencoders

Learning to generate data by encoding it into structured latent spaces. ELBO and the reparameterization trick.

Read more
Object pairs being compared by relation network
#16

Relational Reasoning

Relation Networks compare all object pairs to answer questions about relationships.

Read more
Pre-activation ResNet block diagram
#15

Identity Mappings in Deep Residual Networks

Moving activation before convolution enables training 1000-layer networks.

Read more
Attention weights between source and target
#14

Bahdanau Attention

Neural machine translation by jointly learning to align and translate. The original attention mechanism.

Read more
Self-attention mechanism visualization
#13

Attention Is All You Need

The Transformer architecture replaced recurrence with self-attention. Foundation of GPT, BERT, and modern AI.

Read more
Message passing between graph nodes
#12

Graph Neural Networks

Message passing on graphs. Nodes aggregate information from neighbors to learn representations.

Read more
Dilated convolutions expanding receptive field
#11

Dilated Convolutions

Exponentially expanding receptive fields without losing resolution. WaveNet's secret for audio generation.

Read more
ResNet skip connection diagram
#10

Deep Residual Learning (ResNet)

Skip connections let gradients flow through 152 layers. Learning residuals instead of direct mappings.

Read more
GPipe pipeline parallelism visualization
#9

GPipe

Pipeline parallelism for training giant neural networks. Micro-batches keep accelerators busy.

Read more
Processing sets with attention
#8

Order Matters: Seq2Seq for Sets

Sets have no order, but neural networks need sequences. How to handle permutation invariance.

Read more
Convolutional neural network layers visualization
#7

AlexNet

Krizhevsky, Sutskever, and Hinton's 2012 ImageNet victory proved deep learning could outperform hand-engineered computer vision.

Read more
Attention mechanism as pointer visualization
#6

Pointer Networks

Vinyals, Fortunato, and Jaitly repurposed attention to point at input positions instead of blending hidden states.

Read more
Neural network weight compression visualization
#5

Keeping Neural Networks Simple

Hinton and van Camp showed that penalizing weight complexity leads to better generalization.

Read more
Neural network dropout diagram showing selective application
#4

Recurrent Neural Network Regularization

Zaremba, Sutskever, and Vinyals figured out how to apply dropout to LSTMs without breaking them.

Read more
LSTM cell state conveyor belt visualization
#3

Understanding LSTM Networks

Christopher Olah's 2015 post explained LSTM gates with clarity that textbooks lacked.

Read more
Shakespeare rendered in neural network style
#2

The Unreasonable Effectiveness of RNNs

Karpathy's famous 2015 post showed RNNs could generate Shakespeare and Linux code by predicting one character at a time.

Read more
Coffee mixing visualization - entropy vs complexity
#1

Why Coffee Mixes But Never Unmixes

Scott Aaronson's First Law of Complexodynamics explains why complexity rises, peaks, then falls.

Read more