RNNs process sequences one step at a time. For a 1000-token document, that means 1000 sequential operations. Transformers process all tokens simultaneously. This parallelism enabled the scale that produced GPT-4 and Claude.
Why Sutskever Included This
Sutskever co-founded OpenAI, which built GPT on the transformer architecture. This 2017 paper from Google displaced recurrent networks for most sequence tasks. Every major language model today uses transformers. The paper has over 100,000 citations.
The RNN Bottleneck
RNNs (including LSTMs from Paper #3) pass information through a hidden state that updates at each timestep. This creates two problems:
Sequential processing: You can't compute step 100 before step 99. Training is slow because you can't parallelize across time.
Long-range dependencies: Information must pass through many steps to connect distant tokens. Despite gating, gradients still struggle over hundreds of steps.
Self-Attention
Attention computes relationships between all pairs of positions directly. For each token, compute how much to attend to every other token, then take a weighted average.
Attention(Q, K, V) = softmax(QKT / √dk) V
Q (queries), K (keys), and V (values) are linear projections of the input. The dot product QKT measures similarity. Softmax normalizes to weights. The output is a weighted sum of values.
Every position connects directly to every other position. No information bottleneck. And the whole computation is a matrix multiplication, easily parallelized on GPUs.
Multi-Head Attention
One attention pattern might focus on syntactic relationships (subject-verb). Another might focus on semantic similarity. Multi-head attention runs several attention operations in parallel, each with different learned projections.
With 8 heads, the model learns 8 different ways to relate tokens. Concatenate the results and project back to the original dimension.
Positional Encoding
Self-attention treats positions symmetrically. "The cat sat on the mat" and "mat the on sat cat the" produce identical attention patterns without position information.
The paper adds sinusoidal position encodings to input embeddings. Different frequencies encode different positions, and the model can learn relative positions from these signals. Later work introduced rotary embeddings and other schemes.
The Full Architecture
Encoder: Stack of self-attention layers plus feed-forward networks. Each layer has residual connections (Paper #10) and layer normalization. The encoder sees all positions bidirectionally.
Decoder: Similar stack, but with masked self-attention. When generating token t, the decoder can only attend to positions before t. This prevents cheating by looking at future tokens.
Cross-attention: Decoder layers also attend to encoder outputs, connecting the input sequence to the output sequence.
Architectural Variants
Encoder-only (BERT): Bidirectional understanding. Good for classification, extraction, and embedding.
Decoder-only (GPT): Autoregressive generation. Good for text generation. Simpler to scale.
Encoder-decoder: The original design. Best for sequence-to-sequence tasks like translation.
Modern LLMs use decoder-only architectures. The simplicity scales better, and autoregressive generation handles most tasks.
Scaling Properties
Transformers scaled where RNNs couldn't. Parallelism enables training on massive datasets. Paper #22 (Scaling Laws) quantifies how performance improves predictably with scale.
The attention mechanism has O(n²) complexity in sequence length. Long contexts require optimizations like sparse attention, flash attention, or sliding windows.
Why It Works
Inductive bias: Attention assumes any token might be relevant to any other. This flexibility suits language, where long-range dependencies are common.
Depth and width: Feed-forward layers between attention layers add capacity. Each layer refines representations.
Training efficiency: All positions process in parallel. Training is hours instead of weeks.
Connection to Earlier Papers
Paper #14 (Bahdanau Attention) introduced attention for RNNs. This paper eliminates the RNN entirely. Paper #6 (Pointer Networks) used attention as a selection mechanism; transformers generalize this throughout.
Further Reading
More in This Series
Part of a series on Ilya Sutskever's recommended 30 papers.