Power Aware Dynamic Reallocation For Inference

Yiwei Jiang; Sangeeta Chowdhary; Nathaniel Morris; Rutwik Jain; Srilatha Manne; Sam Bayliss

arXiv:2601.12241·cs.DC·January 21, 2026

Power Aware Dynamic Reallocation For Inference

Yiwei Jiang, Sangeeta Chowdhary, Nathaniel Morris, Rutwik Jain, Srilatha Manne, Sam Bayliss

PDF

Open Access

TL;DR

This paper introduces RAPID, a power-aware framework for disaggregated LLM inference that dynamically manages GPU roles and power budgets to enhance performance within strict power limits, significantly improving efficiency.

Contribution

RAPID is the first framework to jointly optimize GPU roles and power budgets for disaggregated inference, achieving better performance under power constraints.

Findings

01

Up to 2x improvement in SLO attainment at peak load.

02

Significant performance gains over static power assignment.

03

Enhanced application consistency under power caps.

Abstract

Disaggregation has emerged as a powerful strategy for optimizing large language model (LLM) inference by separating compute-intensive prefill and memory-bound decode phases across specialized GPUs. This separation improves utilization and throughput under fixed hardware capacity. However, as model and cluster scales grow, power, rather than compute, has become the dominant limiter of overall performance and cost efficiency. In this paper, we propose RAPID, a power-aware disaggregated inference framework that jointly manages GPU roles and power budgets to sustain goodput within strict power caps. RAPID utilizes static and dynamic power reallocation in addition to GPU reallocation to improve performance under fixed power bounds. RAPID improves overall performance and application consistency beyond what is achievable in current disaggregation solutions, resulting in up to a 2x improvement…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Natural Language Processing Techniques · Parallel Computing and Optimization Techniques