Attention-Informed Surrogates for Navigating Power-Performance Trade-offs in HPC
Ashna Nawar Ahmed, Banooqa Banday, Terry Jones, Tanzima Z. Islam

TL;DR
This paper introduces a novel surrogate-assisted multi-objective Bayesian optimization framework that uses attention-based embeddings of job telemetry to efficiently navigate power-performance trade-offs in HPC scheduling, achieving better results with less data.
Contribution
It is the first to apply embedding-informed surrogates within a MOBO framework for HPC scheduling, improving Pareto front quality and reducing training costs.
Findings
Higher-quality Pareto fronts for runtime-power trade-offs.
Significant reduction in training costs.
Improved stability of optimization results.
Abstract
High-Performance Computing (HPC) schedulers must balance user performance with facility-wide resource constraints. The task boils down to selecting the optimal number of nodes for a given job. We present a surrogate-assisted multi-objective Bayesian optimization (MOBO) framework to automate this complex decision. Our core hypothesis is that surrogate models informed by attention-based embeddings of job telemetry can capture performance dynamics more effectively than standard regression techniques. We pair this with an intelligent sample acquisition strategy to ensure the approach is data-efficient. On two production HPC datasets, our embedding-informed method consistently identified higher-quality Pareto fronts of runtime-power trade-offs compared to baselines. Furthermore, our intelligent data sampling strategy drastically reduced training costs while improving the stability of the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsParallel Computing and Optimization Techniques · Cloud Computing and Resource Management · Distributed and Parallel Computing Systems
