XC-Cache: Cross-Attending to Cached Context for Efficient LLM Inference

Jo\~ao Monteiro,\'Etienne Marcotte; Pierre-Andr\'e No\"el; Valentina; Zantedeschi; David V\'azquez; Nicolas Chapados; Christopher Pal; Perouz; Taslakian

arXiv:2404.15420·cs.CL·November 4, 2024

XC-Cache: Cross-Attending to Cached Context for Efficient LLM Inference

Jo\~ao Monteiro,\'Etienne Marcotte, Pierre-Andr\'e No\"el, Valentina, Zantedeschi, David V\'azquez, Nicolas Chapados, Christopher Pal, Perouz, Taslakian

PDF

Open Access

TL;DR

This paper introduces XC-Cache, a method that uses cross-attention to efficiently condition large language models on reference text without extensive caching, reducing memory usage and improving performance over traditional in-context learning.

Contribution

The paper proposes a novel cross-attention based approach inspired by encoder-decoder models, enabling efficient conditioning of decoder-only models without prompt-based caching.

Findings

01

Outperforms traditional in-context learning in QA tasks

02

Reduces cache space by two orders of magnitude

03

Achieves performance comparable to fine-tuned models

Abstract

In-context learning (ICL) approaches typically leverage prompting to condition decoder-only language model generation on reference information. Just-in-time processing of a context is inefficient due to the quadratic cost of self-attention operations, and caching is desirable. However, caching transformer states can easily require almost as much space as the model parameters. When the right context isn't known in advance, caching ICL can be challenging. This work addresses these limitations by introducing models that, inspired by the encoder-decoder architecture, use cross-attention to condition generation on reference text without the prompt. More precisely, we leverage pre-trained decoder-only models and only train a small number of added layers. We use Question-Answering (QA) as a testbed to evaluate the ability of our models to perform conditional generation and observe that they…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAlgorithms and Data Compression · Natural Language Processing Techniques