Energy-Efficient Wireless LLM Inference via Uncertainty and Importance-Aware Speculative Decoding
Jihoon Park, Seungeun Oh, and Seong-Lyun Kim

TL;DR
This paper introduces an energy-efficient hybrid language model inference method that selectively uploads tokens based on uncertainty and importance, significantly reducing energy consumption and communication costs in resource-limited settings.
Contribution
It presents a novel token filtering mechanism leveraging uncertainty and importance to optimize hybrid LLM inference for energy and communication efficiency.
Findings
Achieves up to 87.5% BERT Score
Reduces energy consumption by 40.7%
Improves token throughput to 0.37 tokens/sec
Abstract
To address the growing demand for on-device LLM inference in resource-constrained environments, hybrid language models (HLM) have emerged, combining lightweight local models with powerful cloud-based LLMs. Recent studies on HLM have primarily focused on improving accuracy and latency, while often overlooking communication and energy efficiency. We propose a token-level filtering mechanism for an energy-efficient importance- and uncertainty-aware HLM inference that leverages both epistemic uncertainty and attention-based importance. Our method opportunistically uploads only informative tokens, reducing LLM usage and communication costs. Experiments with TinyLlama-1.1B and LLaMA-2-7B demonstrate that our method achieves up to 87.5% BERT Score and token throughput of 0.37 tokens/sec while saving the energy consumption by 40.7% compared to standard HLM. Furthermore, compared to our previous…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCooperative Communication and Network Coding · Advanced MIMO Systems Optimization · Wireless Communication Security Techniques
