Entropy Adaptive Decoding: Dynamic Model Switching for Efficient   Inference

Toby Simonds

arXiv:2502.06833·cs.LG·February 12, 2025

Entropy Adaptive Decoding: Dynamic Model Switching for Efficient Inference

Toby Simonds

PDF

Open Access

TL;DR

Entropy Adaptive Decoding (EAD) dynamically switches between models during language generation based on uncertainty, significantly reducing computational costs while maintaining high performance levels across various model sizes.

Contribution

This paper introduces EAD, a novel method that adaptively switches models during inference based on entropy, enabling efficient computation with controlled output divergence.

Findings

01

Achieves 96.7% of 11B model performance using only 43% of tokens.

02

Reduces computational cost by 41.5% with LLaMA models.

03

Maintains high performance with minimal divergence across model pairs.

Abstract

We present Entropy Adaptive Decoding (EAD), a novel approach for efficient language model inference that dynamically switches between different-sized models based on prediction uncertainty. By monitoring rolling entropy in model logit distributions, our method identifies text regions where a smaller model suffices and switches to a larger model only when prediction uncertainty exceeds a threshold. Unlike speculative decoding approaches that maintain perfect output fidelity through verification, EAD accepts controlled output divergence in exchange for computational efficiency. Our experiments on the MATH benchmark demonstrate remarkable efficiency gains across different model families. Using the LLaMA family, we maintain 96.7\% of the 11B model's performance (50.4\% vs 52.1\%) while using it for only 43\% of tokens, decreasing computational cost by 41.5\%. These gains become more…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNeural Networks and Applications

MethodsLLaMA