Don't Do RAG: When Cache-Augmented Generation is All You Need for Knowledge Tasks
Brian J Chan, Chao-Ting Chen, Jui-Hung Cheng, Hen-Hsen Huang

TL;DR
This paper introduces cache-augmented generation (CAG), an approach that leverages large language models' extended context to bypass retrieval, reducing latency and errors while maintaining performance in knowledge tasks.
Contribution
The paper proposes CAG as a retrieval-free alternative to RAG, utilizing preloaded knowledge in LLMs' context to improve efficiency and simplicity for certain applications.
Findings
CAG eliminates retrieval latency and errors.
CAG achieves comparable or better performance than RAG.
CAG is effective when knowledge base is limited.
Abstract
Retrieval-augmented generation (RAG) has gained traction as a powerful approach for enhancing language models by integrating external knowledge sources. However, RAG introduces challenges such as retrieval latency, potential errors in document selection, and increased system complexity. With the advent of large language models (LLMs) featuring significantly extended context windows, this paper proposes an alternative paradigm, cache-augmented generation (CAG) that bypasses real-time retrieval. Our method involves preloading all relevant resources, especially when the documents or knowledge for retrieval are of a limited and manageable size, into the LLM's extended context and caching its runtime parameters. During inference, the model utilizes these preloaded parameters to answer queries without additional retrieval steps. Comparative analyses reveal that CAG eliminates retrieval…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDistributed and Parallel Computing Systems · Parallel Computing and Optimization Techniques · Intelligent Tutoring Systems and Adaptive Learning
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Linear Layer · Residual Connection · Adam · Weight Decay · Multi-Head Attention · Layer Normalization · Heatmap · WordPiece · Dropout
