Sutskever 30 Deep Dives • Paper #3

Understanding LSTM Networks

The Post That Made Deep Learning Accessible

January 20, 2026 • 8 min read

In August 2015, Christopher Olah published "Understanding LSTM Networks," explaining a complex architecture using diagrams and intuition instead of dense equations. The post has been cited thousands of times and translated into dozens of languages.

Memory cells flowing through neural network gates like a conveyor belt
The cell state runs through the entire chain, with gates controlling what gets added or removed.

Why Sutskever Included This

This is the third blog post in Sutskever's list (after Aaronson and Karpathy). Like the others, it's not a research paper. It's pedagogy. Sutskever values clear explanations of foundational concepts, not just novel results.

LSTMs were invented in 1997 by Hochreiter and Schmidhuber, but remained obscure for years. Papers described them with intimidating notation. Olah's post changed that by making the architecture visual and intuitive.

The Problem: Vanishing Gradients

Standard RNNs have a memory problem. They process sequences by passing hidden state from step to step, but gradients shrink exponentially during backpropagation. After 10-20 steps, the gradient signal effectively disappears. The network can't learn long-range dependencies.

Consider predicting the next word in: "I grew up in France... I speak fluent ___." The answer ("French") depends on information from many steps ago. A vanilla RNN forgets "France" by the time it reaches the blank.

The Solution: Cell State

LSTMs add a separate memory channel called the cell state. The cell state is like a conveyor belt running through the network. Information can flow along it unchanged, or be modified by gates at each step.

The cell state solves the vanishing gradient problem because additions and multiplications (rather than repeated nonlinear transformations) allow gradients to flow backward without shrinking. Information placed on the conveyor belt at step 1 can still influence computations at step 100.

Three Gates

Gates are sigmoid layers that output values between 0 and 1. Zero means "block everything," one means "let everything through." The LSTM has three:

Forget gate. Looks at the current input and previous hidden state, outputs a value for each number in the cell state. Decides what to throw away. Processing a new subject in a sentence? Forget the old subject's gender.

Input gate. Two parts: a sigmoid decides which values to update, and a tanh creates candidate values to add. Together they determine what new information enters the cell state.

Output gate. Filters the cell state to produce the hidden state (the actual output). Not everything stored in memory is relevant to the current prediction.

The Equations

The LSTM update at each timestep:

ft = σ(Wf · [ht-1, xt] + bf)

it = σ(Wi · [ht-1, xt] + bi)

t = tanh(WC · [ht-1, xt] + bC)

Ct = ft * Ct-1 + it * C̃t

ot = σ(Wo · [ht-1, xt] + bo)

ht = ot * tanh(Ct)

The equations look dense, but each line corresponds to one of the gates or the cell state update. Olah's contribution was showing how these pieces fit together visually.

Variants

Olah covers LSTM variants in the original post. The most common is the GRU (Gated Recurrent Unit), which combines the forget and input gates into a single "update gate" and merges cell state with hidden state. Fewer parameters, similar performance on many tasks.

Other variants add "peephole connections" (letting gates look at the cell state directly) or couple the forget and input gates (whatever we forget, we replace with new input). Research hasn't found consistent winners. Architecture choice depends on the task.

Why This Matters for Modern AI

Transformers replaced LSTMs for most tasks. Attention mechanisms handle long-range dependencies more effectively and parallelize better on GPUs. But LSTMs remain useful for streaming applications (real-time speech, continuous sensor data) where you can't wait for the full sequence.

The gating concept from LSTMs appears throughout modern architectures. Transformers use gating in their feedforward layers. Mixture-of-experts models gate between different sub-networks. Learned, soft routing of information persists even as specific architectures change.

Olah demonstrated that visual explanations accelerate understanding of complex architectures. His later work on neural network interpretability follows the same approach.

The Bigger Picture

Papers #2 and #3 form a natural pair. Karpathy showed what RNNs can do (generate Shakespeare, code, LaTeX). Olah explained how LSTMs work internally (gates, cell state, gradient flow). Together they cover the "what" and "how" of sequence modeling before transformers.

Both authors later joined major AI labs (Karpathy at OpenAI/Tesla, Olah at Google Brain/Anthropic). Their blog posts demonstrated technical knowledge and the ability to communicate it.


Part of a series on Ilya Sutskever's recommended 30 papers, connecting each to practical AI development.

Try these models yourself

Claude, GPT, Gemini, Llama, and 300+ more. One app, you pick the model.