Cross-Layer Energy Analysis of Multimodal Training on Grace Hopper Superchips
Mahmoud Ahmed, Sameh Abdulah, Olatunji Ruwase, Sam Ade Jacobs, Mathis Bode, Mohamed Elhoseiny, David E. Keyes

TL;DR
This paper analyzes the energy and performance trade-offs of multimodal training on NVIDIA Grace Hopper superchips, emphasizing data movement and hardware-software interactions to optimize energy efficiency.
Contribution
It provides a cross-layer analysis of energy and performance in multimodal training on GH200, offering guidelines for balancing offloading, parallelism, and scheduling.
Findings
Energy efficiency is mainly influenced by data movement and overlap.
Runtime-optimized configurations may not be energy-optimal.
High-bandwidth interconnects enable effective offloading and parallelism.
Abstract
Multimodal deep learning models enable joint learning across heterogeneous data sources, including text, images, and video, but their rapid scaling introduces significant memory and communication bottlenecks. As model sizes and sequence lengths increase, training performance becomes increasingly impacted by data movement rather than computation. Frameworks such as DeepSpeed mitigate these challenges through CPU offloading, activation checkpointing, and communication optimizations. However, these techniques introduce additional system activity, which may affect energy efficiency. Meanwhile, tightly integrated heterogeneous architectures, such as the NVIDIA Grace Hopper (GH200) superchip, provide high-bandwidth CPU-GPU interconnects and unified memory, thereby reducing data transfer overhead. In this work, we present a cross-layer analysis of energy and performance trade-offs in…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
