COMPASS: A Unified Decision-Intelligence System for Navigating Performance Trade-off in HPC
Ankur Lahiry, Banooqa Banday, Yugesh Bhattarai, Mohammad Zaeed, Tanzima Z. Islam

TL;DR
COMPASS is a decision-engine that uses machine learning on HPC operational traces to optimize configurations, improve performance, and provide trustworthy recommendations with uncertainty quantification.
Contribution
It formalizes HPC configuration questions into query patterns and develops an interactive ML-based engine that guides tuning with evidence and uncertainty measures.
Findings
Cuts average job turnaround time by 65.93%
Reduces node usage by 80.93%
Achieves up to 100x faster training and 80x faster inference than existing methods
Abstract
HPC systems expose many configuration parameters that jointly drive competing objectives. Existing tools such as autotuners recommend good configurations but do not identify minimal changes for a near-miss configuration to meet a performance objective, and they often ignore domain-specific constraints. To address this gap, we introduce COMPASS -- a modular, programmable engine that uses operational traces to generate HPC configuration recommendations and guide tuning decisions. This paper: (1) formalizes configuration questions into query patterns; (2) develops an interactive decision-making engine that formulates these queries as Machine Learning (ML) tasks; (3) quantifies the trustworthiness of its recommendations by providing evidence and quantifying uncertainty, and -- when confidence is low -- provides guidance on which configurations to run next. We validate COMPASS using…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
