SLA Management in Reconfigurable Multi-Agent RAG: A Systems Approach to Question Answering
Michael Iannelli, Sneha Kuchipudi, and Vera Dvorak

TL;DR
This paper presents a systems approach to managing multi-agent Retrieval Augmented Generation (RAG) systems for question answering, enabling dynamic reconfiguration to meet diverse SLAs involving quality, cost, and latency.
Contribution
It introduces a method to incorporate non-functional requirements into multi-agent RAG systems, allowing for real-time reconfiguration to optimize performance under resource constraints.
Findings
Effective management of trade-offs between answer quality and cost.
Dynamic re-orchestration improves SLA compliance in QA systems.
System demonstrates adaptability to different query types and operational conditions.
Abstract
Retrieval Augmented Generation (RAG) enables Large Language Models (LLMs) to generalize to new information by decoupling reasoning capabilities from static knowledge bases. Traditional RAG enhancements have explored vertical scaling-assigning subtasks to specialized modules-and horizontal scaling-replicating tasks across multiple agents-to improve performance. However, real-world applications impose diverse Service Level Agreements (SLAs) and Quality of Service (QoS) requirements, involving trade-offs among objectives such as reducing cost, ensuring answer quality, and adhering to specific operational constraints. In this work, we present a systems-oriented approach to multi-agent RAG tailored for real-world Question Answering (QA) applications. By integrating task-specific non-functional requirements-such as answer quality, cost, and latency-into the system, we enable dynamic…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsService-Oriented Architecture and Web Services · Semantic Web and Ontologies · Speech and dialogue systems
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · travel james · Linear Layer · Attention Is All You Need · Dense Connections · Byte Pair Encoding · Residual Connection · Multi-Head Attention · Weight Decay · WordPiece
