Energy-Efficient Wireless LLM Inference via Uncertainty and Importance-Aware Speculative Decoding

Jihoon Park; Seungeun Oh; and Seong-Lyun Kim

arXiv:2508.12590·cs.LG·August 19, 2025

Energy-Efficient Wireless LLM Inference via Uncertainty and Importance-Aware Speculative Decoding

Jihoon Park, Seungeun Oh, and Seong-Lyun Kim

PDF

Open Access

TL;DR

This paper introduces an energy-efficient hybrid language model inference method that selectively uploads tokens based on uncertainty and importance, significantly reducing energy consumption and communication costs in resource-limited settings.

Contribution

It presents a novel token filtering mechanism leveraging uncertainty and importance to optimize hybrid LLM inference for energy and communication efficiency.

Findings

01

Achieves up to 87.5% BERT Score

02

Reduces energy consumption by 40.7%

03

Improves token throughput to 0.37 tokens/sec

Abstract

To address the growing demand for on-device LLM inference in resource-constrained environments, hybrid language models (HLM) have emerged, combining lightweight local models with powerful cloud-based LLMs. Recent studies on HLM have primarily focused on improving accuracy and latency, while often overlooking communication and energy efficiency. We propose a token-level filtering mechanism for an energy-efficient importance- and uncertainty-aware HLM inference that leverages both epistemic uncertainty and attention-based importance. Our method opportunistically uploads only informative tokens, reducing LLM usage and communication costs. Experiments with TinyLlama-1.1B and LLaMA-2-7B demonstrate that our method achieves up to 87.5% BERT Score and token throughput of 0.37 tokens/sec while saving the energy consumption by 40.7% compared to standard HLM. Furthermore, compared to our previous…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsCooperative Communication and Network Coding · Advanced MIMO Systems Optimization · Wireless Communication Security Techniques