Understanding the Performance and Power of LLM Inferencing on Edge Accelerators
Mayank Arya, Yogesh Simmhan

TL;DR
This paper evaluates the performance and power consumption of large language model inference on NVIDIA Jetson Orin AGX edge accelerators, analyzing trade-offs in latency, throughput, and energy use for various models and configurations.
Contribution
It provides a comprehensive analysis of LLM inference on edge hardware, exploring the effects of batch size, sequence length, and quantization on performance and energy efficiency.
Findings
Increasing sequence length reduces token throughput.
Quantization can slow down smaller LLMs.
Trade-offs exist between efficiency, speed, and resource use.
Abstract
Large Language Models (LLMs) have demonstrated exceptional benefits to a wide range of domains, for tasks as diverse as code generation and robot navigation. While LLMs are usually served from cloud data centers, mission-critical and privacy-sensitive applications may require local hosting of open LLM models. Given the large GPU memory footprint needed for LLMs, edge accelerators such as Nvidia Jetson Orin AGX with 64GB of shared GPU-CPU RAM are a compelling choice. However, the feasibility and performance of LLM inference on edge accelerators is under-explored. This study presents a detailed evaluation of LLM inference on the NVIDIA Jetson Orin AGX, on four SOTA models ranging from 2.7B to 32.8B parameters, such as Meta Llama3.1, Microsoft-Phi2, Deepseek-R1-Qwen. We investigate the impact of varying batch sizes, sequence lengths, and quantization levels on latency, throughput, and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Big Data and Digital Economy · Topic Modeling
