Edge Deployment of Small Language Models, a comprehensive comparison of CPU, GPU and NPU backends

Pablo Prieto; Pablo Abad

arXiv:2511.22334·cs.PF·December 10, 2025

Edge Deployment of Small Language Models, a comprehensive comparison of CPU, GPU and NPU backends

Pablo Prieto, Pablo Abad

PDF

Open Access

TL;DR

This paper compares CPU, GPU, and NPU hardware for running small language models at the edge, highlighting that NPUs offer the best performance and energy efficiency for resource-constrained environments.

Contribution

It provides a comprehensive evaluation of different hardware backends for SLM inference, demonstrating the superiority of NPUs in performance and energy efficiency.

Findings

01

NPUs outperform CPUs and GPUs in inference speed and energy efficiency.

02

Bandwidth normalization is crucial for fair cross-architecture comparison.

03

NPUs are the most suitable hardware for edge SLM deployment.

Abstract

Edge computing processes data where it is generated, enabling faster decisions, lower bandwidth usage, and improved privacy. However, edge devices typically operate under strict constraints on processing power, memory, and energy consumption, making them unsuitable for large language models (LLMs). Fortunately, Small Language Models (SLMs) offer lightweight alternatives that bring AI inference to resource-constrained environments by significantly reducing computational cost while remaining suitable for specialization and customization. In this scenario, selecting the hardware platform that best balances performance and efficiency for SLM inference is challenging due to strict resource limitations. To address this issue, this study evaluates the inference performance and energy efficiency of commercial CPUs (Intel and ARM), GPUs (NVIDIA), and NPUs (RaiderChip) for running SLMs. GPUs, the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsBig Data and Digital Economy · Advanced Neural Network Applications · Parallel Computing and Optimization Techniques