Pretraining Recurrent Networks without Recurrence

Massachusetts Institute of Technology. Preprint 2026.

TLDR

We propose Supervised Memory Training (SMT), a replacement for BPTT for training nonlinear RNNs. SMT trains a time-parallel encoder to produce 'optimal' memory states: compressed representations of the past that are predictive of the future. The RNN is trained with one-step supervised learning to mimic transitions between these optimal memory states.

Key Results

SMT achieves:

Fully time-parallel RNN training, enabling scaling.
O(1) gradient path length between tokens, solving vanishing gradients and enabling long-range memory.
Constant time and constant memory inference (per token), unlike Transformers.

Applications

SMT is applicable to seq2seq problems such as generative modeling and behavioral cloning where the goal is to predict future values of a sequence.
SMT is not directly applicable to RL problems as it does not optimize a general reward function. This is a limitation compared to BPTT, which can optimize a general reward.

Citation

@article{kumar2026smt,
  title     = {Pretraining Recurrent Networks without Recurrence},
  author    = {Akarsh Kumar and Phillip Isola},
  year      = {2026},
  url       = {https://arxiv.org/abs/2606.06479},
  note      = {Project page: \url{https://akarshkumar.com/smt}},
}