Inference-Friendly Models With MixAttention
Shashank Rajput, Ying Sheng, Sean Owen, Vitaliy Chiley

TL;DR
MixAttention is a novel model architecture that combines sliding window attention with shared KV caches, significantly reducing memory use and increasing inference speed in language models without losing performance.
Contribution
This work introduces MixAttention, a new attention mechanism that improves inference efficiency by reducing memory consumption while maintaining model accuracy.
Findings
Reduces memory usage during inference
Speeds up inference without performance loss
Effective for both short and long-context tasks
Abstract
The size of the key-value (KV) cache plays a critical role in determining both the maximum context length and the number of concurrent requests supported during inference in modern language models. The KV cache size grows proportionally with the number of attention heads and the tokens processed, leading to increased memory consumption and slower inference for long inputs. In this work, we explore the use of MixAttention, a model architecture modification closely related to a blog published by Character.AI. MixAttention combines sliding window attention, where only a small subset of recent tokens is stored in the KV cache, with KV cache sharing across layers. Our experiments demonstrate that MixAttention significantly reduces memory usage and improves inference speed without sacrificing model performance in both short and long-context tasks. We also explore various configurations of…
Peer Reviews
Decision·Submitted to ICLR 2025
The idea is simple and clear, the experimental setup is also quite clear.
1. This paper lacks innovation; both the recent window and multi-layer attention are established techniques. The paper simply combines these two methods without any improvements. 2. The experimental results are presented solely as bar charts. I believe it would be beneficial to include a table with some precise values. 3. This paper resembles more of a technical report rather than an innovative and well-developed research paper, which does not meet the high standards of ICLR.
1. The combination of sparsifying the token of sequence and sharing the KV cache across layers seems to be a promising method to reduce the inference cost. This paper conducts some interesting experiments, from pre-training to evaluation, to give us some insights regarding the impact of different choices of the setups of such combination. 2. The experiment setup is reasonably designed.
1. The novelty is limited in two ways. Firstly, it is a straightforward combination of two existing techniques without many adjustments. Secondly, this combination has already been explicitly described in the blog of character.ai, as cited by the authors. 2. I can get that the value of this paper is to provide some empirical guidelines of this combination method, but still, the new information brought by this paper is also limited. For example, “…having the standard KV cache computed in the deep
- Cache sharing across layers has not been extensively studied and ablated over, and so this paper provides additional sample points that show the relationship between cache sharing approach and performance. - The authors tested their results on RULER which is a long-context benchmark and more conventional evals such as MMLU and HellaSwag through the Gauntlet evals framework which unveils differences in performance between different KV-cache sharing approaches. - Some of these KV-cache sharing
- Lack of insight or discussion as to why certain cache-sharing approaches perform better or worse. - The paper lacks novelty, as it mostly relies on architectural configurations proposed by a blog by CharacterAI [1], and as a consequence, it lacks explanation as to why these configurations were selected in the first place. - In general, the main critique is that the paper presents only surface level analysis of the observations and does not contribute much to a deeper understanding of why certa
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning and Data Classification · Text and Document Classification Technologies · Machine Learning and Algorithms
MethodsSoftmax · Attention Is All You Need · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings
