In May 2015, Andrej Karpathy trained a recurrent neural network on Shakespeare and watched it generate coherent Elizabethan verse. The network predicted one character at a time, with no grammar rules or vocabulary lists—just pattern recognition on 4.4MB of raw text.
Why Sutskever Included This
This isn't a research paper—it's a blog post. Yet Ilya Sutskever placed it second on his list of 30 essential readings. It demonstrates how sequence modeling works, the same principle that underpins GPT and modern speech recognition.
The key reframe from Karpathy: "If training vanilla neural nets is optimization over functions, training recurrent nets is optimization over programs." A feedforward network maps inputs to outputs. An RNN learns algorithms—stateful procedures that maintain context across a sequence.
The Architecture
An RNN processes sequences by maintaining a hidden state that gets updated at each step. Feed it the letter 'h', it updates its state. Feed it 'e', it updates again, now encoding information about "he". Feed it 'l', 'l', 'o'—by the end, the hidden state represents the entire history of inputs.
The training objective is simple: predict the next character. Given "hell", predict "o". Given "hello ", predict "w" (for "world"). The network learns by seeing millions of these examples and adjusting weights to minimize prediction error.
In practice, vanilla RNNs struggle with long sequences due to vanishing gradients. LSTMs (Long Short-Term Memory networks) solve this with gating mechanisms that control information flow. Karpathy used a 2-layer LSTM with 512 hidden units, about 3.5 million parameters. Tiny by modern standards.
Shakespeare from Scratch
Karpathy concatenated all of Shakespeare into a single text file and trained his network. The network learned to:
- Open and close quotes correctly
- Format dialogue with character names and stage directions
- Structure verse with appropriate line lengths
Actual output from the trained model:
PANDARUS:
Alas, I think he shall be come approached and the day
When little wisdom have been sure, to be more of the sight,
For the other composure of thy sight.
KING RICHARD III:
If I be in the strength, and thy state;
And that our ships and shall and what we shall be here.
Nonsense, but grammatically structured nonsense. The network learned the shape of Shakespeare without understanding meaning. This is pattern recognition operating on pure syntax.
Linux Kernel Code
Karpathy didn't stop at literature. He trained on the Linux kernel source code—474MB of C. The network learned to generate syntactically plausible code:
- Proper bracket matching and indentation
- Correct comment syntax and preprocessor directives
- Plausible function signatures
The code doesn't compile. Variable names are invented. Function calls reference nothing. But the structure is correct. The network learned C's grammar from raw characters.
Wikipedia and LaTeX
Wikipedia training produced markdown formatting, XML tags, and plausible URLs. LaTeX training generated valid mathematical notation with proper environment syntax (\begin{theorem}...\end{theorem}).
Each domain has its own patterns. The RNN discovered them all without being told what a URL is, what LaTeX math mode means, or how XML tags nest.
What the Neurons Learned
Karpathy visualized individual neuron activations and found interpretable features:
- One neuron activated inside quotes and deactivated outside
- Another tracked position within a line for line-length patterns
Nobody programmed these detectors. They emerged from the pressure to predict the next character accurately. If knowing you're inside a quote helps predict the next character, the network will learn to track quotes.
The Deeper Lesson
Simple objectives on raw data can produce complex, interpretable behavior. The network wasn't told about English grammar, programming syntax, or document structure. It learned these from next-character prediction alone.
GPT models are the same idea scaled up: predict the next token, train on internet-scale data, watch complex capabilities emerge. The char-rnn was proof of concept.
Why This Matters for AI Development
Simple objectives can yield complex behavior. Next-character prediction seems trivial. Yet it forced the network to learn grammar, style, formatting, and document structure. Modern LLMs use the same principle: next-token prediction produces reasoning, coding, and conversation.
Interpretability emerges from training. The quote-tracking neuron wasn't designed. It appeared because tracking quotes improved predictions. This suggests neural networks can develop human-interpretable internal representations when those representations are useful.
Scale matters but isn't everything. Karpathy's 3.5M parameter model learned impressive patterns. Modern models have billions of parameters. But the core insight—that sequence prediction learns structure—holds across scales.
From char-rnn to GPT
The trajectory from Karpathy's 2015 blog post to ChatGPT is direct. Replace characters with tokens (subword units). Replace LSTMs with Transformers (better at long-range dependencies). Scale from 4MB of Shakespeare to terabytes of internet text. The principle remains: predict the next token, learn the patterns of language.
Karpathy later joined OpenAI, where these ideas scaled into GPT-2 and beyond.
Further Reading
More in This Series
Part of a series on Ilya Sutskever's recommended 30 papers, connecting each to practical AI development.