Uncertainty-Aware Hybrid Inference with On-Device Small and Remote Large   Language Models

Seungeun Oh; Jinhyuk Kim; Jihong Park; Seung-Woo Ko; Tony Q. S. Quek,; Seong-Lyun Kim

arXiv:2412.12687·cs.LG·March 19, 2025

Uncertainty-Aware Hybrid Inference with On-Device Small and Remote Large Language Models

Seungeun Oh, Jinhyuk Kim, Jihong Park, Seung-Woo Ko, Tony Q. S. Quek,, Seong-Lyun Kim

PDF

Open Access

TL;DR

This paper introduces an uncertainty-aware hybrid inference framework that combines small on-device models with remote large models, significantly reducing communication and computation costs while maintaining high accuracy.

Contribution

The paper proposes U-HLM, a novel structure that uses uncertainty measurement to skip unnecessary LLM inferences, improving efficiency in hybrid language models.

Findings

01

U-HLM reduces uplink and LLM computation by 45.93%.

02

Achieves up to 97.54% of LLM accuracy.

03

Doubles token throughput compared to non-skipping HLM.

Abstract

This paper studies a hybrid language model (HLM) architecture that integrates a small language model (SLM) operating on a mobile device with a large language model (LLM) hosted at the base station (BS) of a wireless network. The HLM token generation process follows the speculative inference principle: the SLM's vocabulary distribution is uploaded to the LLM, which either accepts or rejects it, with rejected tokens being resampled by the LLM. While this approach ensures alignment between the vocabulary distributions of the SLM and LLM, it suffers from low token throughput due to uplink transmission and the computation costs of running both language models. To address this, we propose a novel HLM structure coined Uncertainty-aware opportunistic HLM (U-HLM), wherein the SLM locally measures its output uncertainty and skips both uplink transmissions and LLM operations for tokens that are…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling

MethodsBalanced Selection