Don't Do RAG: When Cache-Augmented Generation is All You Need for   Knowledge Tasks

Brian J Chan; Chao-Ting Chen; Jui-Hung Cheng; Hen-Hsen Huang

arXiv:2412.15605·cs.CL·February 25, 2025

Don't Do RAG: When Cache-Augmented Generation is All You Need for Knowledge Tasks

Brian J Chan, Chao-Ting Chen, Jui-Hung Cheng, Hen-Hsen Huang

PDF

Open Access 1 Repo

TL;DR

This paper introduces cache-augmented generation (CAG), an approach that leverages large language models' extended context to bypass retrieval, reducing latency and errors while maintaining performance in knowledge tasks.

Contribution

The paper proposes CAG as a retrieval-free alternative to RAG, utilizing preloaded knowledge in LLMs' context to improve efficiency and simplicity for certain applications.

Findings

01

CAG eliminates retrieval latency and errors.

02

CAG achieves comparable or better performance than RAG.

03

CAG is effective when knowledge base is limited.

Abstract

Retrieval-augmented generation (RAG) has gained traction as a powerful approach for enhancing language models by integrating external knowledge sources. However, RAG introduces challenges such as retrieval latency, potential errors in document selection, and increased system complexity. With the advent of large language models (LLMs) featuring significantly extended context windows, this paper proposes an alternative paradigm, cache-augmented generation (CAG) that bypasses real-time retrieval. Our method involves preloading all relevant resources, especially when the documents or knowledge for retrieval are of a limited and manageable size, into the LLM's extended context and caching its runtime parameters. During inference, the model utilizes these preloaded parameters to answer queries without additional retrieval steps. Comparative analyses reveal that CAG eliminates retrieval…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

hhhuang/cag
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDistributed and Parallel Computing Systems · Parallel Computing and Optimization Techniques · Intelligent Tutoring Systems and Adaptive Learning

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Linear Layer · Residual Connection · Adam · Weight Decay · Multi-Head Attention · Layer Normalization · Heatmap · WordPiece · Dropout