Entropy Adaptive Decoding: Dynamic Model Switching for Efficient Inference
Toby Simonds

TL;DR
Entropy Adaptive Decoding (EAD) dynamically switches between models during language generation based on uncertainty, significantly reducing computational costs while maintaining high performance levels across various model sizes.
Contribution
This paper introduces EAD, a novel method that adaptively switches models during inference based on entropy, enabling efficient computation with controlled output divergence.
Findings
Achieves 96.7% of 11B model performance using only 43% of tokens.
Reduces computational cost by 41.5% with LLaMA models.
Maintains high performance with minimal divergence across model pairs.
Abstract
We present Entropy Adaptive Decoding (EAD), a novel approach for efficient language model inference that dynamically switches between different-sized models based on prediction uncertainty. By monitoring rolling entropy in model logit distributions, our method identifies text regions where a smaller model suffices and switches to a larger model only when prediction uncertainty exceeds a threshold. Unlike speculative decoding approaches that maintain perfect output fidelity through verification, EAD accepts controlled output divergence in exchange for computational efficiency. Our experiments on the MATH benchmark demonstrate remarkable efficiency gains across different model families. Using the LLaMA family, we maintain 96.7\% of the 11B model's performance (50.4\% vs 52.1\%) while using it for only 43\% of tokens, decreasing computational cost by 41.5\%. These gains become more…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNeural Networks and Applications
MethodsLLaMA
