98$\times$ Faster LLM Routing Without a Dedicated GPU: Flash Attention, Prompt Compression, and Near-Streaming for the vLLM Semantic Router

Xunzhuo Liu; Bowei He; Xue Liu; Andy Luo; Haichen Zhang; Huamin Chen

arXiv:2603.12646·cs.CL·March 16, 2026

98$\times$ Faster LLM Routing Without a Dedicated GPU: Flash Attention, Prompt Compression, and Near-Streaming for the vLLM Semantic Router

Xunzhuo Liu, Bowei He, Xue Liu, Andy Luo, Haichen Zhang, Huamin Chen

PDF

Open Access

TL;DR

This paper introduces a highly optimized semantic router for large language models that achieves 98x speedup and minimal memory usage without requiring a dedicated GPU, enabling efficient long-context classification.

Contribution

The paper presents three staged optimizations, including a custom Flash Attention operator, prompt compression, and near-streaming processing, to drastically improve speed and reduce memory for LLM routing.

Findings

01

38.7× latency reduction with Flash Attention

02

Prompt compression reduces input tokens to 512

03

Total 98× speedup from baseline

Abstract

System-level routers that intercept LLM requests for safety classification, domain routing, and PII detection must be both fast and operationally lightweight: they should add minimal latency to every request, yet not require a dedicated GPU -- an expensive resource better used for LLM inference itself. When the router co-locates on the same GPU as vLLM serving instances, standard attention's $O (n^{2})$ memory makes long-context classification (8K--32K tokens) impossible: at 8K tokens, three concurrent classifiers need $\sim$ 4.5\,GB for attention masks alone, far exceeding the memory left by vLLM. We present three staged optimizations for the vLLM Semantic Router, benchmarked on AMD Instinct MI300X, that solve both the latency and the memory problem. \emph{Stage~1}: a custom CK Flash Attention operator for ONNX Runtime on ROCm reduces attention memory from $O (n^{2})$ to $O (n)$ and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNetwork Packet Processing and Optimization · Software-Defined Networks and 5G · Graph Theory and Algorithms