A Unified Metric Architecture for AI Infrastructure: A Cross-Layer Taxonomy Integrating Performance, Efficiency, and Cost
Qi He

TL;DR
This paper introduces a comprehensive framework that unifies diverse metrics across physical, computational, and economic domains to optimize AI infrastructure performance, efficiency, and cost.
Contribution
It develops a cross-layer taxonomy and the Metric Propagation Graph to systematically analyze and optimize AI infrastructure constraints and interactions.
Findings
Unified 6x3 taxonomy of AI infrastructure metrics
Formalization of cross-layer dependencies via MPG
Enhanced benchmarking and optimization capabilities
Abstract
The growth of large-scale AI systems is increasingly constrained by infrastructure limits: power availability, thermal and water constraints, interconnect scaling, memory pressure, data-pipeline throughput, and rapidly escalating lifecycle cost. Across hyperscale clusters, these constraints interact, yet the main metrics remain fragmented. Existing metrics, ranging from facility measures (PUE) and rack power density to network metrics (all-reduce latency), data-pipeline measures, and financial metrics (TCO series), each capture only their own domain and provide no integrated view of how physical, computational, and economic constraints interact. This fragmentation obscures the structural relationships among energy, computation, and cost, preventing a coherent optimization across sector and how bottlenecks emerge, propagate, and jointly determine the efficiency frontier of AI…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCloud Computing and Resource Management · Big Data and Digital Economy · Software-Defined Networks and 5G
