Superposition Prompting: Improving and Accelerating Retrieval-Augmented Generation
Thomas Merth, Qichen Fu, Mohammad Rastegari, Mahyar Najibi

TL;DR
Superposition prompting is a novel method that enables large language models to process multiple input paths in parallel, reducing inference costs and improving accuracy in retrieval-augmented generation without fine-tuning.
Contribution
It introduces superposition prompting, a new RAG technique that enhances efficiency and accuracy of pre-trained LLMs by parallel processing of prompt paths without fine-tuning.
Findings
93x reduction in compute time on NaturalQuestions-Open dataset
43% accuracy improvement over naive RAG with MPT-7B
Effective across various question-answering benchmarks
Abstract
Despite the successes of large language models (LLMs), they exhibit significant drawbacks, particularly when processing long contexts. Their inference cost scales quadratically with respect to sequence length, making it expensive for deployment in some real-world text processing applications, such as retrieval-augmented generation (RAG). Additionally, LLMs also exhibit the "distraction phenomenon", where irrelevant context in the prompt degrades output quality. To address these drawbacks, we propose a novel RAG prompting methodology, *superposition prompting*, which can be directly applied to pre-trained transformer-based LLMs *without the need for fine-tuning*. At a high level, superposition prompting allows the LLM to process input documents in parallel *prompt paths*, discarding paths once they are deemed irrelevant. We demonstrate the capability of our method to simultaneously…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and dialogue systems · Topic Modeling
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · WordPiece · Byte Pair Encoding · Linear Layer · Layer Normalization · Weight Decay · Dense Connections · Attention Dropout · Residual Connection
