LLM Inference at the Edge: Mobile, NPU, and GPU Performance Efficiency Trade-offs Under Sustained Load

Pranay Tummalapalli; Sahil Arayakandy; Ritam Pal; Kautuk Kundan

arXiv:2603.23640·cs.DC·March 26, 2026

LLM Inference at the Edge: Mobile, NPU, and GPU Performance Efficiency Trade-offs Under Sustained Load

Pranay Tummalapalli, Sahil Arayakandy, Ritam Pal, Kautuk Kundan

PDF

Open Access

TL;DR

This paper benchmarks large language model inference on various edge devices, revealing how thermal, power, and memory constraints impact sustained performance and efficiency.

Contribution

It provides detailed performance analysis of Qwen 2.5 1.5B across mobile, NPU, and GPU platforms under sustained load, highlighting hardware-specific bottlenecks.

Findings

01

Thermal management is the main constraint on mobile devices.

02

Dedicated hardware shows distinct bottlenecks like power and memory bandwidth.

03

Energy efficiency varies greatly across platforms, with some hardware achieving near-zero variance at low throughput.

Abstract

Deploying large language models on-device for always-on personal agents demands sustained inference from hardware tightly constrained in power, thermal envelope, and memory. We benchmark Qwen 2.5 1.5B (4-bit quantised) across four platforms: a Raspberry Pi 5 with Hailo-10H NPU, a Samsung Galaxy S24 Ultra, an iPhone 16 Pro, and a laptop NVIDIA RTX 4050 GPU. Using a fixed 258-token prompt over 20 warm-condition iterations per device, we measure throughput, latency, power, and thermal behaviour. For mobile platforms, thermal management supersedes peak compute as the primary constraint: the iPhone 16 Pro loses nearly half its throughput within two iterations, and the S24 Ultra suffers a hard OS-enforced GPU frequency floor that terminates inference entirely. On dedicated hardware, distinct constraints dominate: the RTX 4050 is bounded by its battery power ceiling, while the Hailo-10H is…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsParallel Computing and Optimization Techniques · Big Data and Digital Economy · Advanced Neural Network Applications