APE: Faster and Longer Context-Augmented Generation via Adaptive Parallel Encoding
Xinyu Yang, Tianqi Chen, Beidi Chen

TL;DR
The paper introduces Adaptive Parallel Encoding (APE), a method that significantly speeds up context-augmented generation by efficiently parallelizing context encoding while maintaining high performance.
Contribution
APE aligns parallel encoding with sequential encoding using shared parameters, enabling faster and scalable context-augmented generation without performance loss.
Findings
Achieves 4.5× speedup in inference time.
Preserves 98% and 93% of sequential encoding performance on RAG and ICL tasks.
Scales to hundreds of contexts in parallel.
Abstract
Context-augmented generation (CAG) techniques, including RAG and ICL, require the efficient combination of multiple contexts to generate responses to user queries. Directly inputting these contexts as a sequence introduces a considerable computational burden by re-encoding the combined selection of contexts for every request. To address this, we explore the promising potential of parallel encoding to independently pre-compute and cache each context's KV states. This approach enables the direct loading of cached states during inference while accommodating more contexts through position reuse across contexts. However, due to misalignments in attention distribution, directly applying parallel encoding results in a significant performance drop. To enable effective and efficient CAG, we propose Adaptive Parallel Encoding (), which brings shared prefix, attention temperature,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Image and Video Retrieval Techniques · Advanced Neural Network Applications · Multimodal Machine Learning Applications
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Attention Dropout · Residual Connection · WordPiece · Linear Layer · Adam · Weight Decay · Dropout · Byte Pair Encoding
