MDN: Parallelizing Stepwise Momentum for Delta Linear Attention

Yulong Huang; Xiang Liu; Hongxiang Huang; Xiaopeng Lin; Zunchang Liu; Xiaowen Chu; Zeke Xie; Bojun Cheng

arXiv:2605.05838·cs.LG·May 8, 2026

MDN: Parallelizing Stepwise Momentum for Delta Linear Attention

Yulong Huang, Xiang Liu, Hongxiang Huang, Xiaopeng Lin, Zunchang Liu, Xiaowen Chu, Zeke Xie, Bojun Cheng

PDF

1 Repo

TL;DR

This paper introduces MDN, a parallelized momentum-based linear attention model that improves training efficiency and performance for large language models on long sequences.

Contribution

It develops a chunkwise parallel algorithm with a stepwise momentum rule and analyzes the dynamics to enhance linear attention models.

Findings

01

MDN achieves comparable training throughput to leading models.

02

Extensive experiments show performance improvements over baselines.

03

MDN performs well on diverse downstream benchmarks.

Abstract

Linear Attention (LA) offers a promising paradigm for scaling large language models (LLMs) to long sequences by avoiding the quadratic complexity of self-attention. Recent LA models such as Mamba2 and GDN interpret linear recurrences as closed-form online stochastic gradient descent (SGD), but naive SGD updates suffer from rapid information decay and suboptimal convergence in optimization. While momentum-based optimizers provide a natural remedy, they pose challenges in simultaneously achieving training efficiency and effectiveness. To address this, we develop a chunkwise parallel algorithm for LA with a stepwise momentum rule by geometrically reordering the update coefficients. Further, from a dynamical systems perspective, we analyze the momentum-based recurrence as a second-order system that introduces complex conjugate eigenvalues. This analysis guides the design of stable gating…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

HuuYuLong/MomentumDeltaNet
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.