The Unreasonable Effectiveness of RNNs

In May 2015, Andrej Karpathy trained a recurrent neural network on Shakespeare and watched it generate coherent Elizabethan verse. The network predicted one character at a time, with no grammar rules or vocabulary lists—just pattern recognition on 4.4MB of raw text.

Why Sutskever Included This

This isn't a research paper—it's a blog post. Yet Ilya Sutskever placed it second on his list of 30 essential readings. It demonstrates how sequence modeling works, the same principle that underpins GPT and modern speech recognition.

The key reframe from Karpathy: "If training vanilla neural nets is optimization over functions, training recurrent nets is optimization over programs." A feedforward network maps inputs to outputs. An RNN learns algorithms—stateful procedures that maintain context across a sequence.

The Architecture

An RNN processes sequences by maintaining a hidden state that gets updated at each step. Feed it the letter 'h', it updates its state. Feed it 'e', it updates again, now encoding information about "he". Feed it 'l', 'l', 'o'—by the end, the hidden state represents the entire history of inputs.

The training objective is simple: predict the next character. Given "hell", predict "o". Given "hello ", predict "w" (for "world"). The network learns by seeing millions of these examples and adjusting weights to minimize prediction error.

In practice, vanilla RNNs struggle with long sequences due to vanishing gradients. LSTMs (Long Short-Term Memory networks) solve this with gating mechanisms that control information flow. Karpathy used a 2-layer LSTM with 512 hidden units, about 3.5 million parameters. Tiny by modern standards.

Shakespeare from Scratch

Karpathy concatenated all of Shakespeare into a single text file and trained his network. The network learned to:

Open and close quotes correctly
Format dialogue with character names and stage directions
Structure verse with appropriate line lengths

Shakespeare's portrait rendered in neural network style, blending Renaissance art with digital aesthetics — The network learned Shakespearean style without any explicit rules about meter, dialogue, or dramatic structure.

Actual output from the trained model:

PANDARUS:

Alas, I think he shall be come approached and the day

When little wisdom have been sure, to be more of the sight,

For the other composure of thy sight.

KING RICHARD III:

If I be in the strength, and thy state;

And that our ships and shall and what we shall be here.

Nonsense, but grammatically structured nonsense. The network learned the shape of Shakespeare without understanding meaning. This is pattern recognition operating on pure syntax.

Linux Kernel Code

Karpathy didn't stop at literature. He trained on the Linux kernel source code—474MB of C. The network learned to generate syntactically plausible code:

Proper bracket matching and indentation
Correct comment syntax and preprocessor directives
Plausible function signatures

The code doesn't compile. Variable names are invented. Function calls reference nothing. But the structure is correct. The network learned C's grammar from raw characters.

Wikipedia and LaTeX

Wikipedia training produced markdown formatting, XML tags, and plausible URLs. LaTeX training generated valid mathematical notation with proper environment syntax (\begin{theorem}...\end{theorem}).

Each domain has its own patterns. The RNN discovered them all without being told what a URL is, what LaTeX math mode means, or how XML tags nest.

What the Neurons Learned

Karpathy visualized individual neuron activations and found interpretable features:

One neuron activated inside quotes and deactivated outside
Another tracked position within a line for line-length patterns

Nobody programmed these detectors. They emerged from the pressure to predict the next character accurately. If knowing you're inside a quote helps predict the next character, the network will learn to track quotes.

The Deeper Lesson

Simple objectives on raw data can produce complex, interpretable behavior. The network wasn't told about English grammar, programming syntax, or document structure. It learned these from next-character prediction alone.

GPT models are the same idea scaled up: predict the next token, train on internet-scale data, watch complex capabilities emerge. The char-rnn was proof of concept.

Why This Matters for AI Development

Simple objectives can yield complex behavior. Next-character prediction seems trivial. Yet it forced the network to learn grammar, style, formatting, and document structure. Modern LLMs use the same principle: next-token prediction produces reasoning, coding, and conversation.

Interpretability emerges from training. The quote-tracking neuron wasn't designed. It appeared because tracking quotes improved predictions. This suggests neural networks can develop human-interpretable internal representations when those representations are useful.

Scale matters but isn't everything. Karpathy's 3.5M parameter model learned impressive patterns. Modern models have billions of parameters. But the core insight—that sequence prediction learns structure—holds across scales.

From char-rnn to GPT

The trajectory from Karpathy's 2015 blog post to ChatGPT is direct. Replace characters with tokens (subword units). Replace LSTMs with Transformers (better at long-range dependencies). Scale from 4MB of Shakespeare to terabytes of internet text. The principle remains: predict the next token, learn the patterns of language.

Karpathy later joined OpenAI, where these ideas scaled into GPT-2 and beyond.

More in This Series

Part of a series on Ilya Sutskever's recommended 30 papers, connecting each to practical AI development.