Block-Attention for Efficient Prefilling

Dongyang Ma; Yan Wang; and Lan Tian

arXiv:2409.15355·cs.LG·April 15, 2025

Block-Attention for Efficient Prefilling

Dongyang Ma, Yan Wang, and Lan Tian

PDF

Open Access 1 Repo 3 Models 1 Datasets 1 Video 3 Reviews

TL;DR

Block-Attention is a novel mechanism that reduces inference latency and computational costs in retrieval-augmented generation by dividing documents into blocks and reusing key-value states, maintaining performance while significantly improving efficiency.

Contribution

The paper introduces Block-Attention, a new attention mechanism that enables efficient inference in RAG scenarios by dividing documents into blocks and reusing KV states, with minimal performance loss.

Findings

01

Achieves 98.7% reduction in time to first token (TTFT)

02

Reduces FLOPs by 99.8% compared to full-attention models

03

Maintains comparable performance to full-attention models after fine-tuning

Abstract

We introduce Block-attention, an attention mechanism designed to address the increased inference latency and cost in Retrieval-Augmented Generation (RAG) scenarios. Traditional approaches often encode the entire context in an auto-regressive manner. Instead, Block-attention divides retrieved documents into discrete blocks, with each block independently calculating key-value (KV) states except for the final block. In RAG scenarios, by defining each passage as a block, Block-attention enables us to reuse the KV states of passages that have been seen before, thereby significantly reducing the latency and the computation overhead during inference. The implementation of Block-attention involves block segmentation, position re-encoding, and fine-tuning the LLM to adapt to the Block-attention mechanism. Experiments on 11 diverse benchmarks, including RAG, ICL, and general domains, demonstrate…

Peer Reviews

Decision·ICLR 2025 Poster

Reviewer 01Rating 6Confidence 3

Strengths

- Novel and practical idea - easy to follow - Simplicity: The solution is relatively simple to implement and can be integrated into existing LLM architectures. - Impressive performance: this method reduces Time to First Token (TTFT) by up to 98.7% and FLOPs by up to 99.8%.

Weaknesses

### Needs for finetuning: The method requires additional fine-tuning, which might be resource-intensive for larger models. It would be better to explore a training-free approach with the proposed method.

Reviewer 02Rating 5Confidence 4

Strengths

- The method is simple and easy to implement: It only involves limiting the attention computation within each document in RAG scenarios and caching the KVs for each block for document re-use as proposed in PromptCache [Gim et. al, 2024]. The newly introduced RoPE rotation correction is also straightforward to implement. - Local attention and prompt caching are sound techniques and suitable for RAG applications. - The paper is clear and easy to follow.

Weaknesses

My primary concern is regarding the paper's contribution compared to the existing works and the need for a better experimental analysis to understand where this method stands among previous methods that share very similar ideas or how it improves them. The authors can find more detailed comments and suggestions below. - Contribution and limitations compared to existing works To me the main contributions of the paper to improve the efficiency of RAG can be summarized as 1) limiting the attentio

Reviewer 03Rating 6Confidence 4

Strengths

1. The idea is simple and reasonable: different retrieved passages are not necessarily related to each other so the attention mask can be sparsified. 2. Positional re-encoding is an elegant way to solve inconsistencies in positional encodings. 3. Finetuning results show no loss of accuracy and significant speedup over self-attention baselines.

Weaknesses

1. The method requires fine-tuning, which limits its scalability. 2. The evaluation seems to be a bit weak. - For example, the method trains on TQA/2Wiki's training set and evaluates on their validation sets. The results on these two benchmarks are not zero-shot and are not very representative. 3. It's unclear what is the overhead associated with positional re-encoding. 4. Can you construct some examples where different retrieved passages are related to each other? In this case, will the propo

Code & Models

Repositories

temporarylora/block-attention
pytorchOfficial

Models

Datasets

ldsjmdy/Tulu3-Block-FT-RAG
dataset· 17 dl
17 dl

Videos

Block-Attention for Efficient Prefilling· slideslive

Taxonomy

TopicsFault Detection and Control Systems · CCD and CMOS Imaging Sensors · Target Tracking and Data Fusion in Sensor Networks

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Linear Layer · Attention Dropout · Dense Connections · Multi-Head Attention · Linear Warmup With Linear Decay · Weight Decay · Adam · WordPiece