You Only Use Reactive Attention Slice For Long Context Retrieval
Yun Joon Soh, Hanxian Huang, Yuandong Tian, Jishen Zhao

TL;DR
This paper introduces YOURA, an attention-based retrieval method for long context LLMs that improves inference efficiency by up to 30% without sacrificing quality, using a novel reaction score heuristic.
Contribution
The paper presents a new attention-based retrieval technique, YOURA, with a unique reaction score heuristic and embedding-agnostic sentence yield, addressing limitations of embedding-based retrieval for long contexts.
Findings
Achieves up to 30% inference throughput improvement.
Maintains nearly identical quality to truncate-middle baseline.
Effective across multiple LLM models and datasets.
Abstract
Supporting longer context for Large Language Models (LLM) is a promising direction to advance LLMs. As training a model for a longer context window is computationally expensive, many alternative solutions, such as Retrieval Augmented Generation (RAG), have been used. However, most existing RAG methods adopt embedding-based retrieval that falls short on long contexts. To address such challenges, we propose an attention-based retrieval technique, You Only Use Reactive Attention slice (YOURA). YOURA leverages a novel retrieval heuristic called reaction score to rank the relevance of each sentence in the input context with the query sentence. Intuitively, we measure how the per-token attention score "reacts" to the query and greedily retrieves the most reactive sentences. Internally, YOURA generates a token-indexed vector (called reaction vector) for the whole input context. To map each…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsImage Retrieval and Classification Techniques · Advanced Image and Video Retrieval Techniques
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Linear Layer · Attention Dropout · Dense Connections · Multi-Head Attention · Linear Warmup With Linear Decay · Weight Decay · Adam · WordPiece
