XC-Cache: Cross-Attending to Cached Context for Efficient LLM Inference
Jo\~ao Monteiro,\'Etienne Marcotte, Pierre-Andr\'e No\"el, Valentina, Zantedeschi, David V\'azquez, Nicolas Chapados, Christopher Pal, Perouz, Taslakian

TL;DR
This paper introduces XC-Cache, a method that uses cross-attention to efficiently condition large language models on reference text without extensive caching, reducing memory usage and improving performance over traditional in-context learning.
Contribution
The paper proposes a novel cross-attention based approach inspired by encoder-decoder models, enabling efficient conditioning of decoder-only models without prompt-based caching.
Findings
Outperforms traditional in-context learning in QA tasks
Reduces cache space by two orders of magnitude
Achieves performance comparable to fine-tuned models
Abstract
In-context learning (ICL) approaches typically leverage prompting to condition decoder-only language model generation on reference information. Just-in-time processing of a context is inefficient due to the quadratic cost of self-attention operations, and caching is desirable. However, caching transformer states can easily require almost as much space as the model parameters. When the right context isn't known in advance, caching ICL can be challenging. This work addresses these limitations by introducing models that, inspired by the encoder-decoder architecture, use cross-attention to condition generation on reference text without the prompt. More precisely, we leverage pre-trained decoder-only models and only train a small number of added layers. We use Question-Answering (QA) as a testbed to evaluate the ability of our models to perform conditional generation and observe that they…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAlgorithms and Data Compression · Natural Language Processing Techniques
