Cache-Craft: Managing Chunk-Caches for Efficient Retrieval-Augmented   Generation

Shubham Agarwal; Sai Sundaresan; Subrata Mitra; Debabrata Mahapatra,; Archit Gupta; Rounak Sharma; Nirmal Joshua Kapu; Tong Yu; Shiv Saini

arXiv:2502.15734·cs.DC·February 25, 2025

Cache-Craft: Managing Chunk-Caches for Efficient Retrieval-Augmented Generation

Shubham Agarwal, Sai Sundaresan, Subrata Mitra, Debabrata Mahapatra,, Archit Gupta, Rounak Sharma, Nirmal Joshua Kapu, Tong Yu, Shiv Saini

PDF

Open Access

TL;DR

Cache-Craft introduces a system to manage and reuse key-value caches in retrieval-augmented generation, significantly reducing redundant computation and latency while maintaining output quality in large language models.

Contribution

The paper presents a novel system for identifying, managing, and efficiently reusing chunk-caches in RAG systems, improving computational efficiency and response times.

Findings

01

Reduces redundant computation by up to 75%.

02

Achieves 1.6X throughput speedup and 2X latency reduction.

03

Maintains output quality with cache management strategies.

Abstract

Retrieval-Augmented Generation (RAG) is often used with Large Language Models (LLMs) to infuse domain knowledge or user-specific information. In RAG, given a user query, a retriever extracts chunks of relevant text from a knowledge base. These chunks are sent to an LLM as part of the input prompt. Typically, any given chunk is repeatedly retrieved across user questions. However, currently, for every question, attention-layers in LLMs fully compute the key values (KVs) repeatedly for the input chunks, as state-of-the-art methods cannot reuse KV-caches when chunks appear at arbitrary locations with arbitrary contexts. Naive reuse leads to output quality degradation. This leads to potentially redundant computations on expensive GPUs and increases latency. In this work, we propose Cache-Craft, a system for managing and reusing precomputed KVs corresponding to the text chunks (we call…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsCaching and Content Delivery · Algorithms and Data Compression · Advanced Data Storage Technologies

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Weight Decay · Linear Layer · Layer Normalization · Byte Pair Encoding · WordPiece · Dense Connections · Attention Dropout · Residual Connection