98$\times$ Faster LLM Routing Without a Dedicated GPU: Flash Attention, Prompt Compression, and Near-Streaming for the vLLM Semantic Router
Xunzhuo Liu, Bowei He, Xue Liu, Andy Luo, Haichen Zhang, Huamin Chen

TL;DR
This paper introduces a highly optimized semantic router for large language models that achieves 98x speedup and minimal memory usage without requiring a dedicated GPU, enabling efficient long-context classification.
Contribution
The paper presents three staged optimizations, including a custom Flash Attention operator, prompt compression, and near-streaming processing, to drastically improve speed and reduce memory for LLM routing.
Findings
38.7× latency reduction with Flash Attention
Prompt compression reduces input tokens to 512
Total 98× speedup from baseline
Abstract
System-level routers that intercept LLM requests for safety classification, domain routing, and PII detection must be both fast and operationally lightweight: they should add minimal latency to every request, yet not require a dedicated GPU -- an expensive resource better used for LLM inference itself. When the router co-locates on the same GPU as vLLM serving instances, standard attention's memory makes long-context classification (8K--32K tokens) impossible: at 8K tokens, three concurrent classifiers need 4.5\,GB for attention masks alone, far exceeding the memory left by vLLM. We present three staged optimizations for the vLLM Semantic Router, benchmarked on AMD Instinct MI300X, that solve both the latency and the memory problem. \emph{Stage~1}: a custom CK Flash Attention operator for ONNX Runtime on ROCm reduces attention memory from to and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNetwork Packet Processing and Optimization · Software-Defined Networks and 5G · Graph Theory and Algorithms
