Pretraining Recurrent Networks without Recurrence
Akarsh Kumar,
Phillip Isola
Massachusetts Institute of Technology. Preprint 2026.
[arXiv]
[Code]
TLDR
We propose Supervised Memory Training (SMT), a replacement for BPTT for training nonlinear RNNs.
SMT trains a time-parallel encoder to produce 'optimal' memory states: compressed representations of the past that are predictive of the future.
The RNN is trained with one-step supervised learning to mimic transitions between these optimal memory states.
Key Results
SMT achieves:
- Fully time-parallel RNN training, enabling scaling.
O(1) gradient path length between tokens, solving vanishing gradients and enabling long-range memory.
- Constant time and constant memory inference (per token), unlike Transformers.
Applications
- SMT is applicable to seq2seq problems such as generative modeling and behavioral cloning where the goal is to predict future values of a sequence.
- SMT is not directly applicable to RL problems as it does not optimize a general reward function. This is a limitation compared to BPTT, which can optimize a general reward.
Citation
@article{kumar2026smt,
title = {Pretraining Recurrent Networks without Recurrence},
author = {Akarsh Kumar and Phillip Isola},
year = {2026},
url = {https://akarshkumar.com/smt},
}