iServe: An Intent-based Serving System for LLMs
Dimitrios Liakopoulos, Tianrui Hu, Prasoon Sinha, Neeraja J. Yadwadkar

TL;DR
iServe is an automated system that dynamically configures large language model deployments based on user intents, significantly improving performance and reducing profiling costs through efficient fingerprint-based estimations.
Contribution
The paper introduces iServe, a novel intent-based LLM serving system that automatically optimizes configurations using lightweight fingerprints, outperforming static approaches.
Findings
Reduces latency by 77.62%
Cuts profiling cost by 6.05x
Improves GPU throughput by 4.72x
Abstract
Large Language Models (LLMs) are becoming ubiquitous across industries, where applications demand they fulfill diverse user intents. However, developers currently face the challenge of manually exploring numerous deployment configurations - combinations of parallelism and compression techniques that impact resource usage, latency, cost, and accuracy - to meet these intents. Assessing the impact of these configurations on user metrics requires extensive, costly profiling for each model. Existing approaches avoid this expense by using fixed, static configurations, but this often leads to sub-optimal performance and higher costs. Moreover, none of these solutions dynamically adapt to changing user intents to balance latency and cost, effectively. We present iServe, an automated, intent-based system for distributed LLM inference. Instead of manually selecting deployment configurations,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsService-Oriented Architecture and Web Services · Semantic Web and Ontologies
MethodsALIGN
