The Workload-Router-Pool Architecture for LLM Inference Optimization: A Vision Paper from the vLLM Semantic Router Project

Huamin Chen; Xunzhuo Liu; Bowei He; Fuyuan Lyu; Yankai Chen; Xue Liu; Yuhan Liu; Junchen Jiang

arXiv:2603.21354·cs.LG·April 10, 2026

The Workload-Router-Pool Architecture for LLM Inference Optimization: A Vision Paper from the vLLM Semantic Router Project

Huamin Chen, Xunzhuo Liu, Bowei He, Fuyuan Lyu, Yankai Chen, Xue Liu, Yuhan Liu, Junchen Jiang

PDF

TL;DR

This paper introduces the Workload-Router-Pool (WRP) architecture, a comprehensive framework for optimizing large language model inference by characterizing workloads, dispatching requests, and selecting inference pools.

Contribution

It formalizes the WRP architecture, mapping prior research onto a 3x3 matrix, and proposes twenty-one research directions for LLM inference optimization.

Findings

01

Mapped prior work onto a 3x3 WRP matrix

02

Identified open research directions at each intersection

03

Tiered research directions from engineering to open research

Abstract

Over the past year, the vLLM Semantic Router project has released a series of work spanning: (1) core routing mechanisms -- signal-driven routing, context-length pool routing, router performance engineering, policy conflict detection, low-latency embedding models, category-aware semantic caching, user-feedback-driven routing adaptation, hallucination detection, and hierarchical content-safety classification for privacy and jailbreak protection; (2) fleet optimization -- fleet provisioning and energy-efficiency analysis; (3) agentic and multimodal routing -- multimodal agent routing, tool selection, CUA security, and multi-turn context memory and safety; (4) governance and standards -- inference routing protocols and multi-provider API extensions. Each paper tackled a specific problem in LLM inference, but the problems are not independent; for example, fleet provisioning depends on the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.