Sutskever 30 Deep Dives • Paper #20

Neural Turing Machines

Differentiable External Memory

January 31, 2026 • 7 min read

A Turing machine has a tape it can read and write. Neural networks have weights and activations. Neural Turing Machines add an external memory matrix that a neural controller can address, read, and write, all differentiable, all trainable with gradient descent.

Neural Turing Machine with memory bank and read/write heads
A controller network interacts with memory through attention-based heads. Content addressing finds relevant data; location addressing enables sequential access.

Why Sutskever Included This

This paper from DeepMind asked: can neural networks learn algorithms? By adding addressable memory, NTMs can learn to copy sequences, sort lists, and perform simple computation. The architecture separates memory from computation, a design principle that influenced later work.

External Memory

The memory is an N × M matrix: N locations, each storing an M-dimensional vector. Unlike LSTM hidden states, this memory is large and explicitly addressable. The network can store data at location 17 and retrieve it later by addressing location 17.

All operations must be differentiable for backpropagation. Soft attention solves this: instead of addressing one location, produce a probability distribution over all locations. Read a weighted average; write proportionally to all locations.

Read and Write Heads

Read heads produce attention weights over memory locations, then output the weighted sum of memory contents. Multiple read heads allow parallel access to different information.

Write heads modify memory through erase and add operations:

Erase: M_t ← M_{t-1} ⊙ (1 - w ⊗ e)
Add: M_t ← M_t + (w ⊗ a)

w is the attention distribution, e is the erase vector, a is the add vector. Memory locations with high attention weight change more.

Content-Based Addressing

Find memory locations by what they contain. The controller produces a query vector. Compute cosine similarity between the query and each memory row. Apply softmax with a sharpness parameter to get attention weights.

Content addressing finds relevant data regardless of where it's stored. Useful when you need to retrieve by association rather than position.

Location-Based Addressing

Sometimes you need sequential access: read location 5, then 6, then 7. Location-based mechanisms enable this:

Interpolation: Blend content-based weights with the previous timestep's weights. Controls whether to search or continue from where you were.

Shift: Rotate the attention distribution by small amounts. Move attention forward or backward by one position.

Sharpening: Concentrate attention on fewer locations. Prevents weights from spreading too diffusely.

Learned Algorithms

The paper demonstrates NTMs learning to copy sequences, sort, and perform associative recall. The network isn't programmed with these algorithms; it discovers them through training.

Watching attention patterns reveals the learned algorithm: for copying, the read head scans through memory sequentially. For sorting, it jumps between locations based on content.

Limitations and Legacy

NTMs are hard to train. Gradients through soft attention can be noisy. Differentiable Neural Computers (DNCs) improved on NTMs with more stable memory allocation. Transformers achieved similar capabilities through self-attention without explicit memory structures.

The core insight persists: separating memory from computation expands what neural networks can learn.


Part of a series on Ilya Sutskever's recommended 30 papers.

Try these models yourself

Claude, GPT, Gemini, Llama, and 300+ more. One app, you pick the model.