Understanding the Performance and Power of LLM Inferencing on Edge Accelerators

Mayank Arya; Yogesh Simmhan

arXiv:2506.09554·cs.DC·June 13, 2025

Understanding the Performance and Power of LLM Inferencing on Edge Accelerators

Mayank Arya, Yogesh Simmhan

PDF

Open Access

TL;DR

This paper evaluates the performance and power consumption of large language model inference on NVIDIA Jetson Orin AGX edge accelerators, analyzing trade-offs in latency, throughput, and energy use for various models and configurations.

Contribution

It provides a comprehensive analysis of LLM inference on edge hardware, exploring the effects of batch size, sequence length, and quantization on performance and energy efficiency.

Findings

01

Increasing sequence length reduces token throughput.

02

Quantization can slow down smaller LLMs.

03

Trade-offs exist between efficiency, speed, and resource use.

Abstract

Large Language Models (LLMs) have demonstrated exceptional benefits to a wide range of domains, for tasks as diverse as code generation and robot navigation. While LLMs are usually served from cloud data centers, mission-critical and privacy-sensitive applications may require local hosting of open LLM models. Given the large GPU memory footprint needed for LLMs, edge accelerators such as Nvidia Jetson Orin AGX with 64GB of shared GPU-CPU RAM are a compelling choice. However, the feasibility and performance of LLM inference on edge accelerators is under-explored. This study presents a detailed evaluation of LLM inference on the NVIDIA Jetson Orin AGX, on four SOTA models ranging from 2.7B to 32.8B parameters, such as Meta Llama3.1, Microsoft-Phi2, Deepseek-R1-Qwen. We investigate the impact of varying batch sizes, sequence lengths, and quantization levels on latency, throughput, and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Big Data and Digital Economy · Topic Modeling