Hyperion: Hierarchical Scheduling for Parallel LLM Acceleration in Multi-tier Networks

Mulei Ma; Xinyi Xu; Minrui Xu; Zihan Chen; Yang Yang; Tony Q.S.Quek

arXiv:2511.14450·cs.DC·December 2, 2025

Hyperion: Hierarchical Scheduling for Parallel LLM Acceleration in Multi-tier Networks

Mulei Ma, Xinyi Xu, Minrui Xu, Zihan Chen, Yang Yang, Tony Q.S.Quek

PDF

Open Access

TL;DR

Hyperion is a hierarchical framework that optimizes model partitioning and request scheduling for parallel LLM inference in multi-tier networks, significantly reducing latency and improving scalability without retraining.

Contribution

It introduces a novel two-stage optimization framework combining offline partitioning with online scheduling to efficiently manage resources for edge-based LLM deployment.

Findings

01

Reduces latency by up to 52.1% compared to GPipe.

02

Achieves 44.5% lower latency and higher GPU utilization in long-sequence generation.

03

Demonstrates scalability benefits in heterogeneous multi-tier networks.

Abstract

LLMs are increasingly executed in edge where limited GPU memory and heterogeneous computation jointly constrain deployment which motivates model partitioning and request scheduling. In this setting, minimizing latency requires addressing the tight coupling between model placement and request scheduling across heterogeneous nodes, as suboptimal decisions in one domain can negate benefits in the other. In this paper, we propose Hyperion, a hierarchical two-stage framework that jointly optimizes partitioning and scheduling for pipelined LLM inference. Hyperion minimizes latency by balancing resources across tiers without requiring model retraining or incurring significant runtime overhead. Leveraging the timescale difference between partitioning and request arrivals, Stage 1 performs offline, inter-tier partitioning via a Hyperion Split with Dynamic Programming (HypSplit-DP) procedure to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsParallel Computing and Optimization Techniques · Cloud Computing and Resource Management · Distributed and Parallel Computing Systems