SLA Management in Reconfigurable Multi-Agent RAG: A Systems Approach to   Question Answering

Michael Iannelli; Sneha Kuchipudi; and Vera Dvorak

arXiv:2412.06832·cs.SE·April 30, 2025

SLA Management in Reconfigurable Multi-Agent RAG: A Systems Approach to Question Answering

Michael Iannelli, Sneha Kuchipudi, and Vera Dvorak

PDF

Open Access

TL;DR

This paper presents a systems approach to managing multi-agent Retrieval Augmented Generation (RAG) systems for question answering, enabling dynamic reconfiguration to meet diverse SLAs involving quality, cost, and latency.

Contribution

It introduces a method to incorporate non-functional requirements into multi-agent RAG systems, allowing for real-time reconfiguration to optimize performance under resource constraints.

Findings

01

Effective management of trade-offs between answer quality and cost.

02

Dynamic re-orchestration improves SLA compliance in QA systems.

03

System demonstrates adaptability to different query types and operational conditions.

Abstract

Retrieval Augmented Generation (RAG) enables Large Language Models (LLMs) to generalize to new information by decoupling reasoning capabilities from static knowledge bases. Traditional RAG enhancements have explored vertical scaling-assigning subtasks to specialized modules-and horizontal scaling-replicating tasks across multiple agents-to improve performance. However, real-world applications impose diverse Service Level Agreements (SLAs) and Quality of Service (QoS) requirements, involving trade-offs among objectives such as reducing cost, ensuring answer quality, and adhering to specific operational constraints. In this work, we present a systems-oriented approach to multi-agent RAG tailored for real-world Question Answering (QA) applications. By integrating task-specific non-functional requirements-such as answer quality, cost, and latency-into the system, we enable dynamic…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsService-Oriented Architecture and Web Services · Semantic Web and Ontologies · Speech and dialogue systems

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · travel james · Linear Layer · Attention Is All You Need · Dense Connections · Byte Pair Encoding · Residual Connection · Multi-Head Attention · Weight Decay · WordPiece