Cost-Efficient Large Language Model Serving for Multi-turn Conversations   with CachedAttention

Bin Gao; Zhuomin He; Puru Sharma; Qingxuan Kang; Djordje Jevdjic,; Junbo Deng; Xingkun Yang; Zhou Yu; Pengfei Zuo

arXiv:2403.19708·cs.CL·July 2, 2024·1 cites

Cost-Efficient Large Language Model Serving for Multi-turn Conversations with CachedAttention

Bin Gao, Zhuomin He, Puru Sharma, Qingxuan Kang, Djordje Jevdjic,, Junbo Deng, Xingkun Yang, Zhou Yu, Pengfei Zuo

PDF

Open Access

TL;DR

This paper introduces CachedAttention, a novel attention mechanism that reuses key-value caches across multi-turn conversations in large language models, drastically reducing computation and serving costs while maintaining performance.

Contribution

CachedAttention provides a hierarchical caching system with layer-wise pre-loading, asynchronous saving, and scheduler-aware cache placement to improve efficiency in multi-turn LLM serving.

Findings

01

TTFT reduced by up to 87%

02

prompt throughput increased by up to 7.8×

03

end-to-end inference cost reduced by up to 70%

Abstract

Interacting with humans through multi-turn conversations is a fundamental feature of large language models (LLMs). However, existing LLM serving engines executing multi-turn conversations are inefficient due to the need to repeatedly compute the key-value (KV) caches of historical tokens, incurring high serving costs. To address the problem, this paper proposes CachedAttention, a new attention mechanism that enables reuse of KV caches across multi-turn conversations, significantly reducing the repetitive computation overheads. CachedAttention maintains a hierarchical KV caching system that leverages cost-effective memory/storage mediums to save KV caches for all requests. To reduce KV cache access overheads from slow mediums, CachedAttention employs layer-wise pre-loading and asynchronous saving schemes to overlap the KV cache access with the GPU computation. To ensure that the KV…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Advanced Graph Neural Networks