HW-GPT-Bench: Hardware-Aware Architecture Benchmark for Language Models
Rhea Sanjay Sukthanker, Arber Zela, Benedikt Staffler, Aaron Klein,, Lennart Purucker, Joerg K.H. Franke, Frank Hutter

TL;DR
HW-GPT-Bench is a hardware-aware benchmark that uses surrogate models to efficiently evaluate and optimize GPT-2 based language models across multiple hardware metrics and devices.
Contribution
It introduces a surrogate-based benchmarking framework for rapid hardware metric estimation of GPT-2 architectures on diverse devices.
Findings
Accurately models latency and energy with calibrated surrogates.
Enables fast simulation of multi-objective optimization trajectories.
Supports evaluation of models up to 1.55B parameters.
Abstract
The increasing size of language models necessitates a thorough analysis across multiple dimensions to assess trade-offs among crucial hardware metrics such as latency, energy consumption, GPU memory usage, and performance. Identifying optimal model configurations under specific hardware constraints is becoming essential but remains challenging due to the computational load of exhaustive training and evaluation on multiple devices. To address this, we introduce HW-GPT-Bench, a hardware-aware benchmark that utilizes surrogate predictions to approximate various hardware metrics across 13 devices of architectures in the GPT-2 family, with architectures containing up to 1.55B parameters. Our surrogates, via calibrated predictions and reliable uncertainty estimates, faithfully model the heteroscedastic noise inherent in the energy and latency measurements. To estimate perplexity, we employ…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Parallel Computing and Optimization Techniques
MethodsAttention Is All You Need · Cosine Annealing · Residual Connection · Discriminative Fine-Tuning · Weight Decay · Softmax · Layer Normalization · Byte Pair Encoding · Attention Dropout · Dropout
