Attention Is All You Need

RNNs process sequences one step at a time. For a 1000-token document, that means 1000 sequential operations. Transformers process all tokens simultaneously. This parallelism enabled the scale that produced GPT-4 and Claude.

Transformer attention patterns with query, key, value matrices — Self-attention lets each position attend to all others. Multi-head attention captures different types of relationships in parallel.

Why Sutskever Included This

Sutskever co-founded OpenAI, which built GPT on the transformer architecture. This 2017 paper from Google displaced recurrent networks for most sequence tasks. Every major language model today uses transformers. The paper has over 100,000 citations.

The RNN Bottleneck

RNNs (including LSTMs from Paper #3) pass information through a hidden state that updates at each timestep. This creates two problems:

Sequential processing: You can't compute step 100 before step 99. Training is slow because you can't parallelize across time.

Long-range dependencies: Information must pass through many steps to connect distant tokens. Despite gating, gradients still struggle over hundreds of steps.

Self-Attention

Attention computes relationships between all pairs of positions directly. For each token, compute how much to attend to every other token, then take a weighted average.

Attention(Q, K, V) = softmax(QK^T / √d_k) V

Q (queries), K (keys), and V (values) are linear projections of the input. The dot product QK^T measures similarity. Softmax normalizes to weights. The output is a weighted sum of values.

Every position connects directly to every other position. No information bottleneck. And the whole computation is a matrix multiplication, easily parallelized on GPUs.

Multi-Head Attention

One attention pattern might focus on syntactic relationships (subject-verb). Another might focus on semantic similarity. Multi-head attention runs several attention operations in parallel, each with different learned projections.

With 8 heads, the model learns 8 different ways to relate tokens. Concatenate the results and project back to the original dimension.

Positional Encoding

Self-attention treats positions symmetrically. "The cat sat on the mat" and "mat the on sat cat the" produce identical attention patterns without position information.

The paper adds sinusoidal position encodings to input embeddings. Different frequencies encode different positions, and the model can learn relative positions from these signals. Later work introduced rotary embeddings and other schemes.

The Full Architecture

Encoder: Stack of self-attention layers plus feed-forward networks. Each layer has residual connections (Paper #10) and layer normalization. The encoder sees all positions bidirectionally.

Decoder: Similar stack, but with masked self-attention. When generating token t, the decoder can only attend to positions before t. This prevents cheating by looking at future tokens.

Cross-attention: Decoder layers also attend to encoder outputs, connecting the input sequence to the output sequence.

Architectural Variants

Encoder-only (BERT): Bidirectional understanding. Good for classification, extraction, and embedding.

Decoder-only (GPT): Autoregressive generation. Good for text generation. Simpler to scale.

Encoder-decoder: The original design. Best for sequence-to-sequence tasks like translation.

Modern LLMs use decoder-only architectures. The simplicity scales better, and autoregressive generation handles most tasks.

Scaling Properties

Transformers scaled where RNNs couldn't. Parallelism enables training on massive datasets. Paper #22 (Scaling Laws) quantifies how performance improves predictably with scale.

The attention mechanism has O(n²) complexity in sequence length. Long contexts require optimizations like sparse attention, flash attention, or sliding windows.

Why It Works

Inductive bias: Attention assumes any token might be relevant to any other. This flexibility suits language, where long-range dependencies are common.

Depth and width: Feed-forward layers between attention layers add capacity. Each layer refines representations.

Training efficiency: All positions process in parallel. Training is hours instead of weeks.

Connection to Earlier Papers

Paper #14 (Bahdanau Attention) introduced attention for RNNs. This paper eliminates the RNN entirely. Paper #6 (Pointer Networks) used attention as a selection mechanism; transformers generalize this throughout.

More in This Series

Part of a series on Ilya Sutskever's recommended 30 papers.