Sutskever 30 Deep Dives • Paper #18

Relational Recurrent Neural Networks

Memory Slots with Self-Attention

January 31, 2026 • 6 min read

LSTMs compress everything into a single hidden state. For tasks requiring comparison between stored items, this bottleneck limits reasoning. Relational RNNs maintain multiple memory slots that interact through attention, enabling explicit relational reasoning within a recurrent framework.

Relational memory with multiple slots connected by attention
Multiple memory slots attend to each other. Each slot can compare its contents to other slots, enabling relational reasoning.

Why Sutskever Included This

This paper bridges recurrent networks and attention mechanisms. Rather than choosing between LSTM memory and transformer-style attention, it combines both. The architecture influenced subsequent work on memory-augmented networks and working memory in neural systems.

The Single-State Bottleneck

Standard LSTMs maintain one hidden state vector. Everything the network remembers must fit there. When a task requires comparing items seen at different timesteps, the network must encode all relevant information and the comparison logic into that single vector.

This works for simple patterns but struggles with relational reasoning: "Is item A larger than item B?" requires maintaining both items distinctly and comparing them.

Multiple Memory Slots

Relational RNNs maintain N separate memory slots instead of one hidden state. Each slot is a vector that persists across timesteps. The slots can store different pieces of information independently.

When new input arrives, it's appended to the memory. Then all slots interact through multi-head self-attention. Each slot can attend to all others, comparing and updating based on relationships.

The Update Process

1. Augment: Append input to memory slots
2. Attend: Multi-head self-attention across slots
3. Process: MLP with residual connection
4. Gate: LSTM-style gating for updates

The attention step is key. Unlike standard RNNs where memory interacts only through the recurrence, here slots interact directly. Different attention heads can learn different types of relationships.

Gated Updates

The architecture uses LSTM-style gates (input, forget, output) to control memory updates. This prevents catastrophic overwriting and allows selective retention of important information.

The combination provides both: attention for relational reasoning and gating for stable long-term memory.

Tasks and Results

The paper tests on tasks requiring multi-step reasoning: sorting sequences, solving algorithmic problems, and language modeling. Relational RNNs outperform LSTMs on tasks with clear relational structure.

The improvement is largest when the task explicitly requires comparing stored items. For simple sequence prediction, the benefit is smaller.

Connection to Transformers

Transformers use self-attention without recurrence. Relational RNNs use self-attention within recurrence. The approaches are complementary: transformers parallelize better; relational RNNs may handle very long sequences with bounded memory.


Part of a series on Ilya Sutskever's recommended 30 papers.

Try these models yourself

Claude, GPT, Gemini, Llama, and 300+ more. One app, you pick the model.