Vortex: Hosting ML Inference and Knowledge Retrieval Services With Tight Latency and Throughput Requirements

Yuting Yang; Tiancheng Yuan; Jamal Hashim; Thiago Garrett; Jeffrey Qian; Ann Zhang; Yifan Wang; Weijia Song; Ken Birman

arXiv:2511.02062·cs.DB·November 5, 2025

Vortex: Hosting ML Inference and Knowledge Retrieval Services With Tight Latency and Throughput Requirements

Yuting Yang, Tiancheng Yuan, Jamal Hashim, Thiago Garrett, Jeffrey Qian, Ann Zhang, Yifan Wang, Weijia Song, Ken Birman

PDF

Open Access

TL;DR

Vortex is a new ML inference and knowledge retrieval platform designed to meet strict latency and throughput requirements, outperforming existing solutions like TorchServe and Ray Serve especially with RDMA support.

Contribution

Vortex introduces an SLO-first approach for ML serving that achieves lower, more stable latencies and higher request rates compared to traditional batching-based platforms.

Findings

01

Vortex achieves significantly lower and more stable latencies.

02

Vortex enables over twice the request rate at the same SLOs.

03

RDMA enhances Vortex's performance advantage.

Abstract

There is growing interest in deploying ML inference and knowledge retrieval as services that could support both interactive queries by end users and more demanding request flows that arise from AIs integrated into a end-user applications and deployed as agents. Our central premise is that these latter cases will bring service level latency objectives (SLOs). Existing ML serving platforms use batching to optimize for high throughput, exposing them to unpredictable tail latencies. Vortex enables an SLO-first approach. For identical tasks, Vortex's pipelines achieve significantly lower and more stable latencies than TorchServe and Ray Serve over a wide range of workloads, often enabling a given SLO target at more than twice the request rate. When RDMA is available, the Vortex advantage is even more significant.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsScientific Computing and Data Management · Cloud Computing and Resource Management · Advanced Database Systems and Queries