iServe: An Intent-based Serving System for LLMs

Dimitrios Liakopoulos; Tianrui Hu; Prasoon Sinha; Neeraja J. Yadwadkar

arXiv:2501.13111·cs.SE·January 24, 2025

iServe: An Intent-based Serving System for LLMs

Dimitrios Liakopoulos, Tianrui Hu, Prasoon Sinha, Neeraja J. Yadwadkar

PDF

Open Access

TL;DR

iServe is an automated system that dynamically configures large language model deployments based on user intents, significantly improving performance and reducing profiling costs through efficient fingerprint-based estimations.

Contribution

The paper introduces iServe, a novel intent-based LLM serving system that automatically optimizes configurations using lightweight fingerprints, outperforming static approaches.

Findings

01

Reduces latency by 77.62%

02

Cuts profiling cost by 6.05x

03

Improves GPU throughput by 4.72x

Abstract

Large Language Models (LLMs) are becoming ubiquitous across industries, where applications demand they fulfill diverse user intents. However, developers currently face the challenge of manually exploring numerous deployment configurations - combinations of parallelism and compression techniques that impact resource usage, latency, cost, and accuracy - to meet these intents. Assessing the impact of these configurations on user metrics requires extensive, costly profiling for each model. Existing approaches avoid this expense by using fixed, static configurations, but this often leads to sub-optimal performance and higher costs. Moreover, none of these solutions dynamically adapt to changing user intents to balance latency and cost, effectively. We present iServe, an automated, intent-based system for distributed LLM inference. Instead of manually selecting deployment configurations,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsService-Oriented Architecture and Web Services · Semantic Web and Ontologies

MethodsALIGN