Curse of High Dimensionality Issue in Transformer for Long-context Modeling

Shuhai Zhang; Zeng You; Yaofo Chen; Zhiquan Wen; Qianyue Wang; Zhijie Qiu; Yuanqing Li; Mingkui Tan

arXiv:2505.22107·cs.CL·August 15, 2025

Curse of High Dimensionality Issue in Transformer for Long-context Modeling

Shuhai Zhang, Zeng You, Yaofo Chen, Zhiquan Wen, Qianyue Wang, Zhijie Qiu, Yuanqing Li, Mingkui Tan

PDF

Open Access 1 Repo

TL;DR

This paper identifies the redundancy in attention mechanisms of transformers for long-context modeling, reformulates the problem to optimize attention, and proposes Dynamic Group Attention to reduce computational costs while maintaining performance.

Contribution

It introduces a novel reformulation of sequence modeling as supervised learning, develops a group coding strategy, and proposes Dynamic Group Attention to improve efficiency in long-context transformers.

Findings

01

DGA reduces computational costs significantly.

02

DGA maintains competitive performance.

03

Theoretical analysis confirms robustness and efficiency improvements.

Abstract

Transformer-based large language models (LLMs) excel in natural language processing tasks by capturing long-range dependencies through self-attention mechanisms. However, long-context modeling faces significant computational inefficiencies due to \textit{redundant} attention computations: while attention weights are often \textit{sparse}, all tokens consume \textit{equal} computational resources. In this paper, we reformulate traditional probabilistic sequence modeling as a \textit{supervised learning task}, enabling the separation of relevant and irrelevant tokens and providing a clearer understanding of redundancy. Based on this reformulation, we theoretically analyze attention sparsity, revealing that only a few tokens significantly contribute to predictions. Building on this, we formulate attention optimization as a linear coding problem and propose a \textit{group coding strategy},…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

bolixinyu/dynamicgroupattention
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Machine Learning in Healthcare · Big Data and Digital Economy

MethodsSoftmax · Attention Is All You Need