MTraining: Distributed Dynamic Sparse Attention for Efficient Ultra-Long Context Training

Wenxuan Li; Chengruidong Zhang; Huiqiang Jiang; Yucheng Li; Yuqing Yang; Lili Qiu

arXiv:2510.18830·cs.CL·May 20, 2026

MTraining: Distributed Dynamic Sparse Attention for Efficient Ultra-Long Context Training

Wenxuan Li, Chengruidong Zhang, Huiqiang Jiang, Yucheng Li, Yuqing Yang, Lili Qiu

PDF

1 Repo

TL;DR

MTraining introduces a distributed dynamic sparse attention method that significantly improves training efficiency for ultra-long context LLMs, enabling larger context windows with maintained accuracy.

Contribution

It presents a novel distributed training approach combining dynamic sparse patterns and hierarchical attention to address imbalance issues in ultra-long context LLM training.

Findings

01

Expanded Qwen2.5-3B context from 32K to 512K tokens

02

Achieved up to 6x higher training throughput

03

Maintained model accuracy across diverse tasks

Abstract

The adoption of long context windows has become a standard feature in Large Language Models (LLMs), as extended contexts significantly enhance their capacity for complex reasoning and broaden their applicability across diverse scenarios. Dynamic sparse attention is a promising approach for reducing the computational cost of long-context. However, efficiently training LLMs with dynamic sparse attention on ultra-long contexts-especially in distributed settings-remains a significant challenge, due in large part to worker- and step-level imbalance. This paper introduces MTraining, a novel distributed methodology leveraging dynamic sparse attention to enable efficient training for LLMs with ultra-long contexts. Specifically, MTraining integrates three key components: a dynamic sparse training pattern, balanced sparse ring attention, and hierarchical sparse ring attention. These components…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

microsoft/MInference/tree/main/MTraining
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Multimodal Machine Learning Applications · Machine Learning in Healthcare