EdgeReasoning: Characterizing Reasoning LLM Deployment on Edge GPUs
Benjamin Kubwimana, Qijing Huang

TL;DR
EdgeReasoning provides a comprehensive analysis of deploying reasoning large language models on edge GPUs, balancing latency, accuracy, and resource constraints to guide optimal deployment strategies.
Contribution
It systematically characterizes latency-accuracy tradeoffs and evaluates techniques for optimizing reasoning LLM deployment on edge GPUs, filling a guidance gap.
Findings
Mapped the Pareto frontier of accuracy and latency configurations.
Evaluated prompt and tuning techniques for token reduction.
Profiled test-time scaling methods for latency optimization.
Abstract
Edge intelligence paradigm is increasingly demanded by the emerging autonomous systems, such as robotics. Beyond ensuring privacy-preserving operation and resilience in connectivity-limited environments, edge deployment offers significant energy and cost advantages over cloud-based solutions. However, deploying large language models (LLMs) for reasoning tasks on edge GPUs faces critical challenges from strict latency constraints and limited computational resources. To navigate these constraints, developers must balance multiple design factors - choosing reasoning versus non-reasoning architectures, selecting appropriate model sizes, allocating token budgets, and applying test-time scaling strategies - to meet target latency and optimize accuracy. Yet guidance on optimal combinations of these variables remains scarce. In this work, we present EdgeReasoning, a comprehensive study…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsIoT and Edge/Fog Computing · Big Data and Digital Economy · Ferroelectric and Negative Capacitance Devices
