30 Papers

Sutskever 30

Ilya Sutskever's recommended reading list. The 30 research papers he considers essential for understanding AI.

Back to all posts
U-shaped attention pattern in long context
#30

Lost in the Middle

Language models struggle with information in the middle of long contexts. The U-shaped attention pattern.

Read more
RAG architecture combining retrieval with generation
#29

Retrieval-Augmented Generation

Combining retrieval with generation. External knowledge grounds language models in facts.

Read more
Semantic embeddings for questions and passages
#28

Dense Passage Retrieval

Learning embeddings for questions and passages. Dual encoders and contrastive learning beat BM25.

Read more
Multiple tokens being predicted in parallel
#27

Multi-Token Prediction

Predicting multiple tokens at once improves sample efficiency and enables speculative decoding.

Read more
Convolutional filters detecting features
#26

CNN Fundamentals (CS231n)

Convolutional layers, pooling, ReLU, and backpropagation. Stanford's CS231n course distilled.

Read more
String compression and algorithmic complexity
#25

Kolmogorov Complexity

The shortest program that outputs a string. Incompressibility equals randomness.

Read more
Intelligence emergence and recursive self-improvement
#24

Machine Superintelligence

Formal definitions of machine intelligence. Recursive self-improvement and the path to superintelligence.

Read more
Data compression through model simplicity
#23

The MDL Principle

Minimum Description Length balances model complexity against fit. Occam's razor made mathematical.

Read more
Power law scaling of neural network performance
#22

Scaling Laws

Power laws for neural language models. How performance scales with compute, data, and parameters.

Read more
Audio waveform being transcribed through CTC
#21

CTC Loss

Training speech models without frame-level alignment. The blank token and alignment-free learning.

Read more
Neural Turing Machine with memory bank
#20

Neural Turing Machines

Differentiable computers with external memory banks. Content and location-based addressing.

Read more
Cream diffusing into coffee
#19

The Coffee Automaton

Why does coffee mix but never unmix? Entropy, coarse-graining, and the arrow of time.

Read more
Relational memory with multiple slots
#18

Relational Recurrent Neural Networks

Memory slots that attend to each other enable multi-step reasoning.

Read more
VAE latent space visualization
#17

Variational Autoencoders

Learning to generate data by encoding it into structured latent spaces. ELBO and the reparameterization trick.

Read more
Object pairs being compared by relation network
#16

Relational Reasoning

Relation Networks compare all object pairs to answer questions about relationships.

Read more
Pre-activation ResNet block diagram
#15

Identity Mappings in Deep Residual Networks

Moving activation before convolution enables training 1000-layer networks.

Read more
Attention weights between source and target
#14

Bahdanau Attention

Neural machine translation by jointly learning to align and translate. The original attention mechanism.

Read more
Self-attention mechanism visualization
#13

Attention Is All You Need

The Transformer architecture replaced recurrence with self-attention. Foundation of GPT, BERT, and modern AI.

Read more
Message passing between graph nodes
#12

Graph Neural Networks

Message passing on graphs. Nodes aggregate information from neighbors to learn representations.

Read more
Dilated convolutions expanding receptive field
#11

Dilated Convolutions

Exponentially expanding receptive fields without losing resolution. WaveNet's secret for audio generation.

Read more
ResNet skip connection diagram
#10

Deep Residual Learning (ResNet)

Skip connections let gradients flow through 152 layers. Learning residuals instead of direct mappings.

Read more
GPipe pipeline parallelism visualization
#9

GPipe

Pipeline parallelism for training giant neural networks. Micro-batches keep accelerators busy.

Read more
Processing sets with attention
#8

Order Matters: Seq2Seq for Sets

Sets have no order, but neural networks need sequences. How to handle permutation invariance.

Read more
Convolutional neural network layers visualization
#7

AlexNet

Krizhevsky, Sutskever, and Hinton's 2012 ImageNet victory proved deep learning could outperform hand-engineered computer vision.

Read more
Attention mechanism as pointer visualization
#6

Pointer Networks

Vinyals, Fortunato, and Jaitly repurposed attention to point at input positions instead of blending hidden states.

Read more
Neural network weight compression visualization
#5

Keeping Neural Networks Simple

Hinton and van Camp showed that penalizing weight complexity leads to better generalization.

Read more
Neural network dropout diagram showing selective application
#4

Recurrent Neural Network Regularization

Zaremba, Sutskever, and Vinyals figured out how to apply dropout to LSTMs without breaking them.

Read more
LSTM cell state conveyor belt visualization
#3

Understanding LSTM Networks

Christopher Olah's 2015 post explained LSTM gates with clarity that textbooks lacked.

Read more
Shakespeare rendered in neural network style
#2

The Unreasonable Effectiveness of RNNs

Karpathy's famous 2015 post showed RNNs could generate Shakespeare and Linux code by predicting one character at a time.

Read more
Coffee mixing visualization - entropy vs complexity
#1

Why Coffee Mixes But Never Unmixes

Scott Aaronson's First Law of Complexodynamics explains why complexity rises, peaks, then falls.

Read more

Get weekly AI insights

Model updates, tips, and guides. No spam.

Thank you.

Unsubscribe anytime. We respect your inbox.