ALISA: Accelerating Large Language Model Inference via Sparsity-Aware KV   Caching

Youpeng Zhao; Di Wu; Jun Wang

arXiv:2403.17312·cs.AI·March 27, 2024·1 cites

ALISA: Accelerating Large Language Model Inference via Sparsity-Aware KV Caching

Youpeng Zhao, Di Wu, Jun Wang

PDF

Open Access

TL;DR

ALISA is a combined algorithm and system approach that enhances large language model inference efficiency by introducing sparse attention and dynamic scheduling, significantly boosting throughput on limited hardware.

Contribution

ALISA's novel sparsity-aware KV caching and dynamic scheduling techniques reduce memory use and improve inference speed for LLMs on resource-constrained systems.

Findings

01

Up to 3X throughput improvement on single GPU-CPU systems.

02

Reduces memory footprint with negligible accuracy loss.

03

Effective in varying workload scenarios.

Abstract

The Transformer architecture has significantly advanced natural language processing (NLP) and has been foundational in developing large language models (LLMs) such as LLaMA and OPT, which have come to dominate a broad range of NLP tasks. Despite their superior accuracy, LLMs present unique challenges in practical inference, concerning the compute and memory-intensive nature. Thanks to the autoregressive characteristic of LLM inference, KV caching for the attention layers in Transformers can effectively accelerate LLM inference by substituting quadratic-complexity computation with linear-complexity memory accesses. Yet, this approach requires increasing memory as demand grows for processing longer sequences. The overhead leads to reduced throughput due to I/O bottlenecks and even out-of-memory errors, particularly on resource-constrained systems like a single commodity GPU. In this…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Network Packet Processing and Optimization · Natural Language Processing Techniques

MethodsAttention Is All You Need · Linear Layer · Layer Normalization · Byte Pair Encoding · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Residual Connection · Multi-Head Attention · Softmax · Dropout