Equinox: Holistic Fair Scheduling in Serving Large Language Models

Zhixiang Wei; James Yen; Jingyi Chen; Ziyang Zhang; Zhibai Huang; Chen Chen; Xingzi Yu; Yicheng Gu; Chenggang Wu; Yun Wang; Mingyuan Xia; Jie Wu; Hao Wang; Zhengwei Qi

arXiv:2508.16646·cs.DC·August 26, 2025

Equinox: Holistic Fair Scheduling in Serving Large Language Models

Zhixiang Wei, James Yen, Jingyi Chen, Ziyang Zhang, Zhibai Huang, Chen Chen, Xingzi Yu, Yicheng Gu, Chenggang Wu, Yun Wang, Mingyuan Xia, Jie Wu, Hao Wang, Zhengwei Qi

PDF

TL;DR

Equinox introduces a holistic fairness-aware scheduling system for large language model serving, predicting key metrics to optimize throughput, latency, and fairness simultaneously across heterogeneous hardware.

Contribution

The paper presents a novel deterministic Mixture of Prediction Experts framework and an open-source system, Equinox, for proactive fairness-aware scheduling in LLM serving.

Findings

01

Up to 1.3x higher throughput compared to VTC

02

60% lower time-to-first-token latency

03

13% higher fairness while maintaining 94% GPU utilization

Abstract

We address the limitations of current LLM serving with a dual-counter framework separating user and operator perspectives. The User Fairness Counter measures quality of service via weighted tokens and latency; the Resource Fairness Counter measures operational efficiency through throughput and GPU utilization. Since these metrics are only available post-execution, creating a scheduling paradox, we introduce a deterministic Mixture of Prediction Experts (MoPE) framework to predict user-perceived latency, output tokens, throughput, and GPU utilization. These predictions enable calculation of a unified Holistic Fairness score that balances both counters through tunable parameters for proactive fairness-aware scheduling. We implement this in Equinox, an open-source system with other optimizations like adaptive batching, and stall-free scheduling. Evaluations on production traces (ShareGPT,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.