RooflineBench: A Benchmarking Framework for On-Device LLMs via Roofline Analysis
Zhen Bi, Xueshu Chen, Luoyang Sun, Yuhang Yao, Qing Shen, Jungang Lou, Cheng Deng

TL;DR
This paper introduces RooflineBench, a benchmarking framework based on the Roofline model, to evaluate and compare the performance and efficiency of on-device Large Language Models across heterogeneous hardware platforms.
Contribution
It proposes a systematic Roofline-based framework and a new metric, Relative Inference Potential, for analyzing LLM performance on resource-constrained hardware.
Findings
Performance and operational intensity vary with sequence length.
Model depth regression impacts operational intensity.
Structural refinements like Multi-head Latent Attention improve inference efficiency.
Abstract
The transition toward localized intelligence through Small Language Models (SLMs) has intensified the need for rigorous performance characterization on resource-constrained edge hardware. However, objectively measuring the theoretical performance ceilings of diverse architectures across heterogeneous platforms remains a formidable challenge. In this work, we propose a systematic framework based on the Roofline model that unifies architectural primitives and hardware constraints through the lens of operational intensity (OI). By defining an inference-potential region, we introduce the Relative Inference Potential as a novel metric to compare efficiency differences between Large Language Models (LLMs) on the same hardware substrate. Extensive empirical analysis across diverse compute tiers reveals that variations in performance and OI are significantly influenced by sequence length. We…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Machine Learning in Materials Science · Parallel Computing and Optimization Techniques
