APE: Faster and Longer Context-Augmented Generation via Adaptive   Parallel Encoding

Xinyu Yang; Tianqi Chen; Beidi Chen

arXiv:2502.05431·cs.LG·February 13, 2025

APE: Faster and Longer Context-Augmented Generation via Adaptive Parallel Encoding

Xinyu Yang, Tianqi Chen, Beidi Chen

PDF

Open Access 1 Repo

TL;DR

The paper introduces Adaptive Parallel Encoding (APE), a method that significantly speeds up context-augmented generation by efficiently parallelizing context encoding while maintaining high performance.

Contribution

APE aligns parallel encoding with sequential encoding using shared parameters, enabling faster and scalable context-augmented generation without performance loss.

Findings

01

Achieves 4.5× speedup in inference time.

02

Preserves 98% and 93% of sequential encoding performance on RAG and ICL tasks.

03

Scales to hundreds of contexts in parallel.

Abstract

Context-augmented generation (CAG) techniques, including RAG and ICL, require the efficient combination of multiple contexts to generate responses to user queries. Directly inputting these contexts as a sequence introduces a considerable computational burden by re-encoding the combined selection of contexts for every request. To address this, we explore the promising potential of parallel encoding to independently pre-compute and cache each context's KV states. This approach enables the direct loading of cached states during inference while accommodating more contexts through position reuse across contexts. However, due to misalignments in attention distribution, directly applying parallel encoding results in a significant performance drop. To enable effective and efficient CAG, we propose Adaptive Parallel Encoding ( $APE$ ), which brings shared prefix, attention temperature,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

infini-ai-lab/ape
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Image and Video Retrieval Techniques · Advanced Neural Network Applications · Multimodal Machine Learning Applications

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Attention Dropout · Residual Connection · WordPiece · Linear Layer · Adam · Weight Decay · Dropout · Byte Pair Encoding