Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference

Jiaming Tang; Yilong Zhao; Kan Zhu; Guangxuan Xiao; Baris Kasikci,; Song Han

arXiv:2406.10774·cs.CL·August 28, 2024·1 cites

Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference

Jiaming Tang, Yilong Zhao, Kan Zhu, Guangxuan Xiao, Baris Kasikci,, Song Han

PDF

Open Access 1 Repo

TL;DR

Quest introduces a query-aware method to selectively load only the most critical KV cache pages during long-context LLM inference, significantly accelerating self-attention and reducing latency with minimal accuracy loss.

Contribution

The paper presents a novel query-aware KV cache selection algorithm that improves inference efficiency for long-context LLMs by focusing on critical tokens based on query information.

Findings

01

Achieves up to 2.23x speedup in self-attention

02

Reduces inference latency by 7.03x

03

Maintains accuracy on long-dependency tasks

Abstract

As the demand for long-context large language models (LLMs) increases, models with context windows of up to 128K or 1M tokens are becoming increasingly prevalent. However, long-context LLM inference is challenging since the inference speed decreases significantly as the sequence length grows. This slowdown is primarily caused by loading a large KV cache during self-attention. Previous works have shown that a small portion of critical tokens will dominate the attention outcomes. However, we observe the criticality of a token highly depends on the query. To this end, we propose Quest, a query-aware KV cache selection algorithm. Quest keeps track of the minimal and maximal Key values in KV cache pages and estimates the criticality of a given page using Query vectors. By only loading the Top-K critical KV cache pages for attention, Quest significantly speeds up self-attention without…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

mit-han-lab/quest
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsBrain Tumor Detection and Classification · Anomaly Detection Techniques and Applications · Time Series Analysis and Forecasting

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings