Diagonal Batching Unlocks Parallelism in Recurrent Memory Transformers for Long Contexts
Danil Sivtsov, Ivan Rodkin, Gleb Kuzmin, Yuri Kuratov, Ivan Oseledets

TL;DR
This paper introduces Diagonal Batching, a run-time scheduling scheme that enables parallel inference in Recurrent Memory Transformers, significantly improving speed and efficiency for long-context processing without retraining.
Contribution
Diagonal Batching is a novel scheduling method that unlocks parallelism in RMTs, eliminating sequential constraints and enabling efficient GPU inference for long sequences.
Findings
3.3x speedup over standard full-attention LLaMA-1B
1.8x speedup over sequential RMT on 131,072 tokens
Reduces inference cost and latency for long-context models
Abstract
Transformer models struggle with long-context inference due to their quadratic time and linear memory complexity. Recurrent Memory Transformers (RMTs) offer a solution by reducing the asymptotic cost to linear time and constant memory usage. However, their memory update mechanism leads to sequential execution, causing a performance bottleneck. We introduce Diagonal Batching, a scheduling scheme that unlocks parallelism across segments in RMTs while preserving exact recurrence. This approach eliminates the sequential constraint, enabling efficient GPU inference even for single long-context inputs without complex batching and pipelining techniques. Because the technique is purely a run-time computation reordering, existing RMT models adopt it with no retraining. Applied to a LLaMA-1B ARMT model, Diagonal Batching yields a 3.3x speedup over standard full-attention LLaMA-1B and a 1.8x…
Peer Reviews
Decision·Submitted to ICLR 2026
- good exposition of background information and framing of the contributions of the paper - proposed method seems simple and can be applied quite generally across different models. If indeed true, the method provides a lot of latency gains basically for free. - method section seems quite complete. The authors go through both the high-level motivations but also outline the implementation details. - experiments section seems complete as well. The authors show comparisons with different scales, seq
- The method is actually slower for sequence length 4096 and 8192. - The tables and figures mainly show latency. I would have wanted to see the effect also on other metrics like memory or GPU utilization. - The paper reported error accumulation numbers, but I would have wanted to also see the actual effect on the produced tokens, or maybe just verify that scores on some common benchmarks like MMLU remain the same.
1. Practical relevance – The paper addresses a real bottleneck: GPU underutilization during long-context inference in memory-augmented transformers. The proposal is pragmatic, compatible with existing hardware, and doesn’t require custom CUDA, which makes it accessible. 2. Elegant scheduling insight – The diagonal reordering idea is conceptually simple yet powerful, exposing latent parallelism while preserving recurrence—an often tricky balance. 3. Strong empirical evidence – The results convi
1. Incremental nature – The main novelty lies in scheduling, not modeling or theory. While the implementation is clever, the conceptual leap from standard pipelining or grouped execution is limited. The paper frames this as a major innovation, but it’s more of an engineering optimization than a new algorithmic idea. 2. Limited empirical diversity – All experiments are on LLaMA-based ARMTs. There is no exploration of how the method generalizes to other PRMT architectures (e.g., RWKV or Mamba) be
- general method for sequence-recurrent architectures - good speed ups (1.1-3.3) for the pre-fill part (latency) of inference
- limited application to pre-filling (latency) optimization of recurrent LLMs - no practical RMT model (e.g. an existing RWKV/xLSTM/Mamba-based model) shown where this is applied (e.g. for reasoning tasks, where long sequences would be strongly beneficial) - RMTs in general break translational invariance in text, so some tokens would be "different" from others depending on their positions (especially at the segment border)
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsParallel Computing and Optimization Techniques · Advanced Neural Network Applications · Generative Adversarial Networks and Image Synthesis
MethodsADaptive gradient method with the OPTimal convergence rate
