Every Token Counts: Generalizing 16M Ultra-Long Context in Large Language Models

Xiang Hu; Zhanchao Zhou; Ruiqi Liang; Zehuan Li; Wei Wu; Jianguo Li

arXiv:2511.23319·cs.CL·December 1, 2025

Every Token Counts: Generalizing 16M Ultra-Long Context in Large Language Models

Xiang Hu, Zhanchao Zhou, Ruiqi Liang, Zehuan Li, Wei Wu, Jianguo Li

PDF

Open Access

TL;DR

This paper introduces HSA-UltraLong, a large language model with hierarchical sparse attention enabling efficient processing of ultra-long contexts up to 16 million tokens, advancing long-term memory capabilities.

Contribution

We propose Hierarchical Sparse Attention (HSA), a novel mechanism integrated into Transformers, to efficiently handle ultra-long contexts, and demonstrate its effectiveness in a large-scale MoE model trained on 8 trillion tokens.

Findings

01

Achieves over 90% accuracy on in-context retrieval tasks with 16M token contexts

02

Performs comparably to full-attention models on in-domain lengths

03

Provides insights and open problems for ultra-long context modeling.

Abstract

This work explores the challenge of building ``Machines that Can Remember'', framing long-term memory as the problem of efficient ultra-long context modeling. We argue that this requires three key properties: \textbf{sparsity}, \textbf{random-access flexibility}, and \textbf{length generalization}. To address ultra-long-context modeling, we leverage Hierarchical Sparse Attention (HSA), a novel attention mechanism that satisfies all three properties. We integrate HSA into Transformers to build HSA-UltraLong, which is an 8B-parameter MoE model trained on over 8 trillion tokens and is rigorously evaluated on different tasks with in-domain and out-of-domain context lengths to demonstrate its capability in handling ultra-long contexts. Results show that our model performs comparably to full-attention baselines on in-domain lengths while achieving over 90\% accuracy on most in-context…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Domain Adaptation and Few-Shot Learning · Multimodal Machine Learning Applications