LLM Inference at the Edge: Mobile, NPU, and GPU Performance Efficiency Trade-offs Under Sustained Load
Pranay Tummalapalli, Sahil Arayakandy, Ritam Pal, Kautuk Kundan

TL;DR
This paper benchmarks large language model inference on various edge devices, revealing how thermal, power, and memory constraints impact sustained performance and efficiency.
Contribution
It provides detailed performance analysis of Qwen 2.5 1.5B across mobile, NPU, and GPU platforms under sustained load, highlighting hardware-specific bottlenecks.
Findings
Thermal management is the main constraint on mobile devices.
Dedicated hardware shows distinct bottlenecks like power and memory bandwidth.
Energy efficiency varies greatly across platforms, with some hardware achieving near-zero variance at low throughput.
Abstract
Deploying large language models on-device for always-on personal agents demands sustained inference from hardware tightly constrained in power, thermal envelope, and memory. We benchmark Qwen 2.5 1.5B (4-bit quantised) across four platforms: a Raspberry Pi 5 with Hailo-10H NPU, a Samsung Galaxy S24 Ultra, an iPhone 16 Pro, and a laptop NVIDIA RTX 4050 GPU. Using a fixed 258-token prompt over 20 warm-condition iterations per device, we measure throughput, latency, power, and thermal behaviour. For mobile platforms, thermal management supersedes peak compute as the primary constraint: the iPhone 16 Pro loses nearly half its throughput within two iterations, and the S24 Ultra suffers a hard OS-enforced GPU frequency floor that terminates inference entirely. On dedicated hardware, distinct constraints dominate: the RTX 4050 is bounded by its battery power ceiling, while the Hailo-10H is…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsParallel Computing and Optimization Techniques · Big Data and Digital Economy · Advanced Neural Network Applications
