Every Token Counts: Generalizing 16M Ultra-Long Context in Large Language Models
Xiang Hu, Zhanchao Zhou, Ruiqi Liang, Zehuan Li, Wei Wu, Jianguo Li

TL;DR
This paper introduces HSA-UltraLong, a large language model with hierarchical sparse attention enabling efficient processing of ultra-long contexts up to 16 million tokens, advancing long-term memory capabilities.
Contribution
We propose Hierarchical Sparse Attention (HSA), a novel mechanism integrated into Transformers, to efficiently handle ultra-long contexts, and demonstrate its effectiveness in a large-scale MoE model trained on 8 trillion tokens.
Findings
Achieves over 90% accuracy on in-context retrieval tasks with 16M token contexts
Performs comparably to full-attention models on in-domain lengths
Provides insights and open problems for ultra-long context modeling.
Abstract
This work explores the challenge of building ``Machines that Can Remember'', framing long-term memory as the problem of efficient ultra-long context modeling. We argue that this requires three key properties: \textbf{sparsity}, \textbf{random-access flexibility}, and \textbf{length generalization}. To address ultra-long-context modeling, we leverage Hierarchical Sparse Attention (HSA), a novel attention mechanism that satisfies all three properties. We integrate HSA into Transformers to build HSA-UltraLong, which is an 8B-parameter MoE model trained on over 8 trillion tokens and is rigorously evaluated on different tasks with in-domain and out-of-domain context lengths to demonstrate its capability in handling ultra-long contexts. Results show that our model performs comparably to full-attention baselines on in-domain lengths while achieving over 90\% accuracy on most in-context…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Domain Adaptation and Few-Shot Learning · Multimodal Machine Learning Applications
