FlashBack:Efficient Retrieval-Augmented Language Modeling for Long Context Inference

Runheng Liu; Xingchen Xiao; Heyan Huang; Zewen Chi; Zhijing Wu

arXiv:2405.04065·cs.CL·June 16, 2025

FlashBack:Efficient Retrieval-Augmented Language Modeling for Long Context Inference

Runheng Liu, Xingchen Xiao, Heyan Huang, Zewen Chi, Zhijing Wu

PDF

Open Access

TL;DR

FlashBack is a retrieval-augmented language model that efficiently appends retrieved documents at the end of the context, significantly speeding up inference while maintaining good generation quality.

Contribution

It introduces a novel appending context pattern with marking tokens, enabling more efficient utilization of the KV cache during inference in RALM.

Findings

01

Up to 4x faster inference speed on a 7B LLM.

02

Maintains decent generation quality with perplexity.

03

Reduces inference cost significantly.

Abstract

Retrieval-Augmented Language Modeling (RALM) by integrating large language models (LLM) with relevant documents from an external corpus is a proven method for enabling the LLM to generate information beyond the scope of its pre-training corpus. Previous work utilizing retrieved content by simply prepending it to the input poses a high runtime issue, which degrades the inference efficiency of the LLMs because they fail to use the Key-Value (KV) cache efficiently. In this paper, we propose FlashBack, a modular RALM designed to improve the inference efficiency of RALM with appending context pattern while maintaining decent performance after fine-tuning by Low-Rank Adaption. FlashBack appends retrieved documents at the end of the context for efficiently utilizing the KV cache instead of prepending them. And we introduce Marking Token as two special prompt tokens for marking the boundary of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings