Diagonal Batching Unlocks Parallelism in Recurrent Memory Transformers for Long Contexts

Danil Sivtsov; Ivan Rodkin; Gleb Kuzmin; Yuri Kuratov; Ivan Oseledets

arXiv:2506.05229·cs.LG·June 6, 2025

Diagonal Batching Unlocks Parallelism in Recurrent Memory Transformers for Long Contexts

Danil Sivtsov, Ivan Rodkin, Gleb Kuzmin, Yuri Kuratov, Ivan Oseledets

PDF

Open Access 1 Repo 3 Reviews

TL;DR

This paper introduces Diagonal Batching, a run-time scheduling scheme that enables parallel inference in Recurrent Memory Transformers, significantly improving speed and efficiency for long-context processing without retraining.

Contribution

Diagonal Batching is a novel scheduling method that unlocks parallelism in RMTs, eliminating sequential constraints and enabling efficient GPU inference for long sequences.

Findings

01

3.3x speedup over standard full-attention LLaMA-1B

02

1.8x speedup over sequential RMT on 131,072 tokens

03

Reduces inference cost and latency for long-context models

Abstract

Transformer models struggle with long-context inference due to their quadratic time and linear memory complexity. Recurrent Memory Transformers (RMTs) offer a solution by reducing the asymptotic cost to linear time and constant memory usage. However, their memory update mechanism leads to sequential execution, causing a performance bottleneck. We introduce Diagonal Batching, a scheduling scheme that unlocks parallelism across segments in RMTs while preserving exact recurrence. This approach eliminates the sequential constraint, enabling efficient GPU inference even for single long-context inputs without complex batching and pipelining techniques. Because the technique is purely a run-time computation reordering, existing RMT models adopt it with no retraining. Applied to a LLaMA-1B ARMT model, Diagonal Batching yields a 3.3x speedup over standard full-attention LLaMA-1B and a 1.8x…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 8Confidence 2

Strengths

- good exposition of background information and framing of the contributions of the paper - proposed method seems simple and can be applied quite generally across different models. If indeed true, the method provides a lot of latency gains basically for free. - method section seems quite complete. The authors go through both the high-level motivations but also outline the implementation details. - experiments section seems complete as well. The authors show comparisons with different scales, seq

Weaknesses

- The method is actually slower for sequence length 4096 and 8192. - The tables and figures mainly show latency. I would have wanted to see the effect also on other metrics like memory or GPU utilization. - The paper reported error accumulation numbers, but I would have wanted to also see the actual effect on the produced tokens, or maybe just verify that scores on some common benchmarks like MMLU remain the same.

Reviewer 02Rating 4Confidence 4

Strengths

1. Practical relevance – The paper addresses a real bottleneck: GPU underutilization during long-context inference in memory-augmented transformers. The proposal is pragmatic, compatible with existing hardware, and doesn’t require custom CUDA, which makes it accessible. 2. Elegant scheduling insight – The diagonal reordering idea is conceptually simple yet powerful, exposing latent parallelism while preserving recurrence—an often tricky balance. 3. Strong empirical evidence – The results convi

Weaknesses

1. Incremental nature – The main novelty lies in scheduling, not modeling or theory. While the implementation is clever, the conceptual leap from standard pipelining or grouped execution is limited. The paper frames this as a major innovation, but it’s more of an engineering optimization than a new algorithmic idea. 2. Limited empirical diversity – All experiments are on LLaMA-based ARMTs. There is no exploration of how the method generalizes to other PRMT architectures (e.g., RWKV or Mamba) be

Reviewer 03Rating 2Confidence 4

Strengths

- general method for sequence-recurrent architectures - good speed ups (1.1-3.3) for the pre-fill part (latency) of inference

Weaknesses

- limited application to pre-filling (latency) optimization of recurrent LLMs - no practical RMT model (e.g. an existing RWKV/xLSTM/Mamba-based model) shown where this is applied (e.g. for reasoning tasks, where long sequences would be strongly beneficial) - RMTs in general break translational invariance in text, so some tokens would be "different" from others depending on their positions (especially at the segment border)

Code & Models

Repositories

svtdanny/diagonal-batching
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsParallel Computing and Optimization Techniques · Advanced Neural Network Applications · Generative Adversarial Networks and Image Synthesis

MethodsADaptive gradient method with the OPTimal convergence rate