TL;DR
MTraining introduces a distributed dynamic sparse attention method that significantly improves training efficiency for ultra-long context LLMs, enabling larger context windows with maintained accuracy.
Contribution
It presents a novel distributed training approach combining dynamic sparse patterns and hierarchical attention to address imbalance issues in ultra-long context LLM training.
Findings
Expanded Qwen2.5-3B context from 32K to 512K tokens
Achieved up to 6x higher training throughput
Maintained model accuracy across diverse tasks
Abstract
The adoption of long context windows has become a standard feature in Large Language Models (LLMs), as extended contexts significantly enhance their capacity for complex reasoning and broaden their applicability across diverse scenarios. Dynamic sparse attention is a promising approach for reducing the computational cost of long-context. However, efficiently training LLMs with dynamic sparse attention on ultra-long contexts-especially in distributed settings-remains a significant challenge, due in large part to worker- and step-level imbalance. This paper introduces MTraining, a novel distributed methodology leveraging dynamic sparse attention to enable efficient training for LLMs with ultra-long contexts. Specifically, MTraining integrates three key components: a dynamic sparse training pattern, balanced sparse ring attention, and hierarchical sparse ring attention. These components…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Multimodal Machine Learning Applications · Machine Learning in Healthcare
