ALISA: Accelerating Large Language Model Inference via Sparsity-Aware KV Caching
Youpeng Zhao, Di Wu, Jun Wang

TL;DR
ALISA is a combined algorithm and system approach that enhances large language model inference efficiency by introducing sparse attention and dynamic scheduling, significantly boosting throughput on limited hardware.
Contribution
ALISA's novel sparsity-aware KV caching and dynamic scheduling techniques reduce memory use and improve inference speed for LLMs on resource-constrained systems.
Findings
Up to 3X throughput improvement on single GPU-CPU systems.
Reduces memory footprint with negligible accuracy loss.
Effective in varying workload scenarios.
Abstract
The Transformer architecture has significantly advanced natural language processing (NLP) and has been foundational in developing large language models (LLMs) such as LLaMA and OPT, which have come to dominate a broad range of NLP tasks. Despite their superior accuracy, LLMs present unique challenges in practical inference, concerning the compute and memory-intensive nature. Thanks to the autoregressive characteristic of LLM inference, KV caching for the attention layers in Transformers can effectively accelerate LLM inference by substituting quadratic-complexity computation with linear-complexity memory accesses. Yet, this approach requires increasing memory as demand grows for processing longer sequences. The overhead leads to reduced throughput due to I/O bottlenecks and even out-of-memory errors, particularly on resource-constrained systems like a single commodity GPU. In this…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Network Packet Processing and Optimization · Natural Language Processing Techniques
MethodsAttention Is All You Need · Linear Layer · Layer Normalization · Byte Pair Encoding · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Residual Connection · Multi-Head Attention · Softmax · Dropout
