Patchwork: A Unified Framework for RAG Serving
Bodun Hu, Luis Pabon, Saurabh Agarwal, Aditya Akella

TL;DR
Patchwork is a comprehensive framework that optimizes the deployment and management of Retrieval Augmented Generation systems, significantly improving efficiency, scalability, and reliability through flexible design and dynamic scheduling.
Contribution
It introduces a unified, end-to-end RAG serving framework with customizable pipelines, distributed inference deployment, and online scheduling for improved performance and reliability.
Findings
Achieves over 48% throughput improvement
Reduces SLO violations by approximately 24%
Demonstrates effectiveness across four RAG implementations
Abstract
Retrieval Augmented Generation (RAG) has emerged as a new paradigm for enhancing Large Language Model reliability through integration with external knowledge sources. However, efficient deployment of these systems presents significant technical challenges due to their inherently heterogeneous computational pipelines comprising LLMs, databases, and specialized processing components. We introduce Patchwork, a comprehensive end-to-end RAG serving framework designed to address these efficiency bottlenecks. Patchwork's architecture offers three key innovations: First, it provides a flexible specification interface enabling users to implement custom RAG pipelines. Secondly, it deploys these pipelines as distributed inference systems while optimizing for the unique scalability characteristics of individual RAG components. Third, Patchwork incorporates an online scheduling mechanism that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Linear Warmup With Linear Decay · Layer Normalization · Byte Pair Encoding · Attention Dropout · Softmax · WordPiece · Linear Layer · Weight Decay
