METIS: Fast Quality-Aware RAG Systems with Configuration Adaptation
Siddhant Ray, Rui Pan, Zhuohan Gu, Kuntai Du, Shaoting Feng, Ganesh Ananthanarayanan, Ravi Netravali, Junchen Jiang

TL;DR
METIS is a novel RAG system that dynamically schedules queries and adapts configurations to optimize the tradeoff between response quality and delay, significantly reducing latency without quality loss.
Contribution
It introduces the first joint scheduling and configuration adaptation approach for RAG systems to balance quality and response time.
Findings
Reduces RAG response latency by 1.64-2.54 times.
Maintains high response quality despite latency improvements.
Demonstrates effectiveness on 4 popular RAG-QA datasets.
Abstract
RAG (Retrieval Augmented Generation) allows LLMs (large language models) to generate better responses with external knowledge, but using more external knowledge often improves generation quality at the expense of response delay. Prior work either reduces the response delay (through better scheduling of RAG queries) or strives to maximize quality (which involves tuning the RAG workflow), but they fall short in optimizing the tradeoff between the delay and quality of RAG responses. This paper presents METIS, the first RAG system that jointly schedules queries and adapts the key RAG configurations of each query, such as the number of retrieved text chunks and synthesis methods, in order to balance quality optimization and response delay reduction. Using 4 popular RAG-QA datasets, we show that compared with the state-of-the-art RAG optimization schemes, METIS reduces the generation latency…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEmbedded Systems Design Techniques · Interconnection Networks and Systems · Fault Detection and Control Systems
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Linear Layer · Attention Is All You Need · Dense Connections · Byte Pair Encoding · Multi-Head Attention · Residual Connection · Attention Dropout · Linear Warmup With Linear Decay · Weight Decay
