CALM: A Self-Adaptive Orchestration Approach for QoS-Aware Routing in Small Language Model based Systems
Hemang Jain, Divyansh Pandey, Karthik Vaidhyanathan

TL;DR
CALM is a self-adaptive orchestration system that dynamically manages a fleet of Small Language Models to optimize QoS metrics like latency and energy consumption in real-time, addressing uncertainties in LLM-based systems.
Contribution
This paper introduces CALM, a novel MAPE-K based self-adaptive framework for orchestrating multiple SLMs to improve QoS in LLM-enabled systems.
Findings
Reduces latency by approximately 40%
Lowers energy consumption by 50%
Maintains task performance with dynamic SLM management
Abstract
AI-enabled systems are subjected to various types of runtime uncertainties, ranging from dynamic workloads, resource requirements, model drift, etc. These uncertainties have a big impact on the overall Quality of Service (QoS). This is particularly true in the case of Language Model (LM) enabled systems where the autoregressive nature of token generation introduces variability in latency, energy usage and response quality. These systems, powered by LLMs, are either resource-intensive (if run on-prem) or raise privacy/cost concerns (if leveraged using APIs). While deploying a Small Language Model (SLM) can be resource-efficient, it often falls short in addressing the diversity and scale of real-world requirements. To this, we argue that, rather than relying on any one SLM, leveraging a coordinated fleet of SLMs, each with specialized strengths can enable systems to dynamically adapt to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware System Performance and Reliability · Scientific Computing and Data Management · Natural Language Processing Techniques
