Kareus: Joint Reduction of Dynamic and Static Energy in Large Model Training
Ruofan Wu, Jae-Won Chung, Mosharaf Chowdhury

TL;DR
Kareus is a training system that jointly optimizes dynamic and static energy consumption in large model training, achieving significant reductions in energy use and training time.
Contribution
It introduces a novel joint optimization approach for dynamic and static energy, decomposing the problem into local subproblems and applying multi-objective algorithms.
Findings
Reduces training energy by up to 28.3% at same training time
Reduces training time by up to 27.5% at same energy
Demonstrates effectiveness over state-of-the-art methods
Abstract
The computing demand of AI is growing at an unprecedented rate, but energy supply is not keeping pace. As a result, energy has become an expensive, contended resource that requires explicit management and optimization. Although recent works have made significant progress in large model training optimization, they focus only on a single aspect of energy consumption: dynamic or static energy. We find that fine-grained kernel scheduling and frequency scaling jointly and interdependently impact both dynamic and static energy consumption. Based on this finding, we design Kareus, a training system that pushes the time--energy tradeoff frontier by optimizing both aspects. Kareus decomposes the intractable joint optimization problem into local, partition-based subproblems. It then uses a multi-pass multi-objective optimization algorithm to find execution schedules that push the time--energy…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsBig Data and Digital Economy · Parallel Computing and Optimization Techniques · Green IT and Sustainability
