Round Attention: A Novel Round-Level Attention Mechanism to Accelerate LLM Inference

Yaohua Tang; Zhicheng Hu; Kun Cheng; Fan Mo; Qiheng Lv; Hua Wang; Zhi Chen

arXiv:2502.15294·cs.CL·June 30, 2025

Round Attention: A Novel Round-Level Attention Mechanism to Accelerate LLM Inference

Yaohua Tang, Zhicheng Hu, Kun Cheng, Fan Mo, Qiheng Lv, Hua Wang, Zhi Chen

PDF

Open Access

TL;DR

This paper introduces Round Attention, a new mechanism that selectively processes relevant dialogue rounds in LLMs, significantly reducing memory usage while maintaining accuracy, thus improving inference efficiency.

Contribution

The paper proposes a novel round-level attention mechanism that dynamically selects relevant dialogue rounds, reducing memory consumption during LLM inference without sacrificing performance.

Findings

01

Reduces memory usage by 54% to 82%.

02

Maintains answer accuracy with sparse KV cache.

03

Identifies a watershed layer in dialogue data.

Abstract

The increasing context window size in large language models (LLMs) has improved their ability to handle complex, long-text tasks. However, as the conversation rounds continue, it is required to store a large amount of KV cache in GPU memory, which significantly affects the efficiency and even availability of the model serving systems. This paper analyzes dialogue data from real users on the granularity of round and discovers that the LLM inference manifests a watershed layer, after which the distribution of round-level attention shows notable similarity. Based on this, we propose Round Attention - a novel round-level attention mechanism that selectively processes the KV cache of top-k relevant rounds, where k is dynamically determined through the attention matrix in the watershed layer. Theoretical analysis demonstrates that our method reduces memory usage by 54\% to 82\%, while…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNeural Networks and Applications · Seismology and Earthquake Studies · Advanced Electrical Measurement Techniques

MethodsSoftmax · Attention Is All You Need