OmniInfer: System-Wide Acceleration Techniques for Optimizing LLM Serving Throughput and Latency

Jun Wang; Yunxiang Yao; Wenwei Kuang; Runze Mao; Zhenhao Sun; Zhuang Tao; Ziyang Zhang; Dengyu Li; Jiajun Chen; Zhili Wang; Kai Cui; Congzhi Cai; Longwen Lan; Ken Zhang

arXiv:2511.22481·cs.DC·December 1, 2025

OmniInfer: System-Wide Acceleration Techniques for Optimizing LLM Serving Throughput and Latency

Jun Wang, Yunxiang Yao, Wenwei Kuang, Runze Mao, Zhenhao Sun, Zhuang Tao, Ziyang Zhang, Dengyu Li, Jiajun Chen, Zhili Wang, Kai Cui, Congzhi Cai, Longwen Lan, Ken Zhang

PDF

Open Access

TL;DR

OmniInfer is a comprehensive system-level framework that optimizes large language model serving by integrating load-aware scheduling, sparse attention acceleration, and request management, significantly improving throughput and latency.

Contribution

It introduces a unified acceleration framework combining three novel components for end-to-end LLM serving optimization, addressing computation, latency, and throughput challenges.

Findings

01

Achieves 616 QPM on DeepSeek-R1 with a 36% reduction in TPOT.

02

Reduces TTFT by 38% through integrated system optimizations.

03

Demonstrates significant performance gains on a 10-node Ascend cluster.

Abstract

Large Language Models drive a wide range of modern AI applications but impose substantial challenges on large-scale serving systems due to intensive computation, strict latency constraints, and throughput bottlenecks. We introduce OmniInfer, a unified system-level acceleration framework designed to maximize end-to-end serving efficiency through fine-grained optimization of expert placement, cache compression, and scheduling. OmniInfer integrates three complementary components: OmniPlacement for load-aware Mixture-of-Experts scheduling, OmniAttn for sparse attention acceleration, and OmniProxy for disaggregation-aware request scheduling. Built atop vLLM, OmniInfer delivers system-wide performance gains through adaptive resource disaggregation, efficient sparsity exploitation, and global coordination across prefill and decode phases. Evaluated on DeepSeek-R1 within a 10-node Ascend 910C…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsBig Data and Digital Economy · IoT and Edge/Fog Computing · Cloud Computing and Resource Management