Cloud to Edge: Benchmarking LLM Inference On Hardware-Accelerated Single-Board Computers
Harri Renney, Fouad Trad, Michael Mattarock, Zena Wood

TL;DR
This paper introduces a multi-dimensional benchmarking methodology for evaluating LLM inference on hardware-accelerated single-board computers, addressing deployment challenges in privacy-sensitive and resource-limited environments.
Contribution
It presents a comprehensive evaluation framework that assesses inference performance and hardware efficiency across various edge platforms with accelerators.
Findings
Hardware accelerators like NPUs and GPUs improve inference speed.
Multi-dimensional metrics reveal trade-offs between power, size, and throughput.
Guidance provided for deploying LLMs in constrained environments.
Abstract
Large language models (LLMs) are becoming increasingly capable at small parameter scales. At the same time, conventional cloud-centric deployment introduces challenges around data privacy, latency, and cost that are acute in operational technology and defence environments. Advances in model distillation, quantisation, and affordable edge accelerators now make local LLM inference on single-board computers feasible, but the high dimensionality of the configuration space makes identifying optimal deployments difficult without structured evaluation. Existing LLM-specific edge benchmarking efforts rely on CPU-only inference, poor coverage of genuine single-board computers, and generic evaluation tasks that lack multi-dimensional assessment of hardware effectiveness. This paper proposes a multi-dimensional benchmarking methodology that jointly evaluates inference performance and hardware…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
