Inference-Friendly Models With MixAttention

Shashank Rajput; Ying Sheng; Sean Owen; Vitaliy Chiley

arXiv:2409.15012·cs.CL·September 24, 2024

Inference-Friendly Models With MixAttention

Shashank Rajput, Ying Sheng, Sean Owen, Vitaliy Chiley

PDF

Open Access 1 Repo 3 Reviews

TL;DR

MixAttention is a novel model architecture that combines sliding window attention with shared KV caches, significantly reducing memory use and increasing inference speed in language models without losing performance.

Contribution

This work introduces MixAttention, a new attention mechanism that improves inference efficiency by reducing memory consumption while maintaining model accuracy.

Findings

01

Reduces memory usage during inference

02

Speeds up inference without performance loss

03

Effective for both short and long-context tasks

Abstract

The size of the key-value (KV) cache plays a critical role in determining both the maximum context length and the number of concurrent requests supported during inference in modern language models. The KV cache size grows proportionally with the number of attention heads and the tokens processed, leading to increased memory consumption and slower inference for long inputs. In this work, we explore the use of MixAttention, a model architecture modification closely related to a blog published by Character.AI. MixAttention combines sliding window attention, where only a small subset of recent tokens is stored in the KV cache, with KV cache sharing across layers. Our experiments demonstrate that MixAttention significantly reduces memory usage and improves inference speed without sacrificing model performance in both short and long-context tasks. We also explore various configurations of…

Peer Reviews

Decision·Submitted to ICLR 2025

Reviewer 01Rating 1Confidence 4

Strengths

The idea is simple and clear, the experimental setup is also quite clear.

Weaknesses

1. This paper lacks innovation; both the recent window and multi-layer attention are established techniques. The paper simply combines these two methods without any improvements. 2. The experimental results are presented solely as bar charts. I believe it would be beneficial to include a table with some precise values. 3. This paper resembles more of a technical report rather than an innovative and well-developed research paper, which does not meet the high standards of ICLR.

Reviewer 02Rating 3Confidence 5

Strengths

1. The combination of sparsifying the token of sequence and sharing the KV cache across layers seems to be a promising method to reduce the inference cost. This paper conducts some interesting experiments, from pre-training to evaluation, to give us some insights regarding the impact of different choices of the setups of such combination. 2. The experiment setup is reasonably designed.

Weaknesses

1. The novelty is limited in two ways. Firstly, it is a straightforward combination of two existing techniques without many adjustments. Secondly, this combination has already been explicitly described in the blog of character.ai, as cited by the authors. 2. I can get that the value of this paper is to provide some empirical guidelines of this combination method, but still, the new information brought by this paper is also limited. For example, “…having the standard KV cache computed in the deep

Reviewer 03Rating 3Confidence 4

Strengths

- Cache sharing across layers has not been extensively studied and ablated over, and so this paper provides additional sample points that show the relationship between cache sharing approach and performance. - The authors tested their results on RULER which is a long-context benchmark and more conventional evals such as MMLU and HellaSwag through the Gauntlet evals framework which unveils differences in performance between different KV-cache sharing approaches. - Some of these KV-cache sharing

Weaknesses

- Lack of insight or discussion as to why certain cache-sharing approaches perform better or worse. - The paper lacks novelty, as it mostly relies on architectural configurations proposed by a blog by CharacterAI [1], and as a consequence, it lacks explanation as to why these configurations were selected in the first place. - In general, the main critique is that the paper presents only surface level analysis of the observations and does not contribute much to a deeper understanding of why certa

Code & Models

Repositories

whyNLP/LCKV
pytorch

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning and Data Classification · Text and Document Classification Technologies · Machine Learning and Algorithms

MethodsSoftmax · Attention Is All You Need · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings