Round Attention: A Novel Round-Level Attention Mechanism to Accelerate LLM Inference
Yaohua Tang, Zhicheng Hu, Kun Cheng, Fan Mo, Qiheng Lv, Hua Wang, Zhi Chen

TL;DR
This paper introduces Round Attention, a new mechanism that selectively processes relevant dialogue rounds in LLMs, significantly reducing memory usage while maintaining accuracy, thus improving inference efficiency.
Contribution
The paper proposes a novel round-level attention mechanism that dynamically selects relevant dialogue rounds, reducing memory consumption during LLM inference without sacrificing performance.
Findings
Reduces memory usage by 54% to 82%.
Maintains answer accuracy with sparse KV cache.
Identifies a watershed layer in dialogue data.
Abstract
The increasing context window size in large language models (LLMs) has improved their ability to handle complex, long-text tasks. However, as the conversation rounds continue, it is required to store a large amount of KV cache in GPU memory, which significantly affects the efficiency and even availability of the model serving systems. This paper analyzes dialogue data from real users on the granularity of round and discovers that the LLM inference manifests a watershed layer, after which the distribution of round-level attention shows notable similarity. Based on this, we propose Round Attention - a novel round-level attention mechanism that selectively processes the KV cache of top-k relevant rounds, where k is dynamically determined through the attention matrix in the watershed layer. Theoretical analysis demonstrates that our method reduces memory usage by 54\% to 82\%, while…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNeural Networks and Applications · Seismology and Earthquake Studies · Advanced Electrical Measurement Techniques
MethodsSoftmax · Attention Is All You Need
