Hyperion: Hierarchical Scheduling for Parallel LLM Acceleration in Multi-tier Networks
Mulei Ma, Xinyi Xu, Minrui Xu, Zihan Chen, Yang Yang, Tony Q.S.Quek

TL;DR
Hyperion is a hierarchical framework that optimizes model partitioning and request scheduling for parallel LLM inference in multi-tier networks, significantly reducing latency and improving scalability without retraining.
Contribution
It introduces a novel two-stage optimization framework combining offline partitioning with online scheduling to efficiently manage resources for edge-based LLM deployment.
Findings
Reduces latency by up to 52.1% compared to GPipe.
Achieves 44.5% lower latency and higher GPU utilization in long-sequence generation.
Demonstrates scalability benefits in heterogeneous multi-tier networks.
Abstract
LLMs are increasingly executed in edge where limited GPU memory and heterogeneous computation jointly constrain deployment which motivates model partitioning and request scheduling. In this setting, minimizing latency requires addressing the tight coupling between model placement and request scheduling across heterogeneous nodes, as suboptimal decisions in one domain can negate benefits in the other. In this paper, we propose Hyperion, a hierarchical two-stage framework that jointly optimizes partitioning and scheduling for pipelined LLM inference. Hyperion minimizes latency by balancing resources across tiers without requiring model retraining or incurring significant runtime overhead. Leveraging the timescale difference between partitioning and request arrivals, Stage 1 performs offline, inter-tier partitioning via a Hyperion Split with Dynamic Programming (HypSplit-DP) procedure to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsParallel Computing and Optimization Techniques · Cloud Computing and Resource Management · Distributed and Parallel Computing Systems
