RelayAttention for Efficient Large Language Model Serving with Long   System Prompts

Lei Zhu; Xinjiang Wang; Wayne Zhang; Rynson W.H. Lau

arXiv:2402.14808·cs.CL·May 31, 2024·1 cites

RelayAttention for Efficient Large Language Model Serving with Long System Prompts

Lei Zhu, Xinjiang Wang, Wayne Zhang, Rynson W.H. Lau

PDF

Open Access 1 Repo 1 Video

TL;DR

RelayAttention is a novel attention algorithm that reduces redundant memory accesses in large language model serving with long system prompts, significantly improving efficiency without retraining.

Contribution

It introduces RelayAttention, a reformulation of causal attention that minimizes memory access redundancy, enhancing LLM service efficiency with long prompts.

Findings

01

Significant performance improvements in vLLM system

02

Efficiency gains increase with longer system prompts

03

No model retraining required

Abstract

A practical large language model (LLM) service may involve a long system prompt, which specifies the instructions, examples, and knowledge documents of the task and is reused across requests. However, the long system prompt causes throughput/latency bottlenecks as the cost of generating the next token grows w.r.t. the sequence length. This paper aims to improve the efficiency of LLM services that involve long system prompts. Our key observation is that handling these system prompts requires heavily redundant memory accesses in existing causal attention computation algorithms. Specifically, for batched requests, the cached hidden states (\ie, key-value pairs) of system prompts are transferred from off-chip DRAM to on-chip SRAM multiple times, each corresponding to an individual request. To eliminate such a redundancy, we propose RelayAttention, an attention algorithm that allows reading…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

rayleizhu/vllm-ra
pytorchOfficial

Videos

RelayAttention for Efficient Large Language Model Serving with Long System Prompts· underline

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques

Methodstravel james