Dynamic Vocabulary Pruning in Early-Exit LLMs
Jort Vincenti, Karim Abdel Sadek, Joan Velja, Matteo Nulli, Metod, Jazbec

TL;DR
This paper introduces a method for dynamically pruning the vocabulary during inference in early-exit large language models, significantly improving efficiency without sacrificing performance.
Contribution
It proposes a novel post-hoc dynamic vocabulary pruning technique at early layers to enhance confidence estimation efficiency in early-exit LLMs.
Findings
Improved inference efficiency in early-exit LLMs
Maintained competitive performance with vocabulary pruning
Reduced computational cost of confidence estimation
Abstract
Increasing the size of large language models (LLMs) has been shown to lead to better performance. However, this comes at the cost of slower and more expensive inference. Early-exiting is a promising approach for improving the efficiency of LLM inference by enabling next token prediction at intermediate layers. Yet, the large vocabulary size in modern LLMs makes the confidence estimation required for exit decisions computationally expensive, diminishing the efficiency gains. To address this, we propose dynamically pruning the vocabulary at test time for each token. Specifically, the vocabulary is pruned at one of the initial layers, and the smaller vocabulary is then used throughout the rest of the forward pass. Our experiments demonstrate that such post-hoc dynamic vocabulary pruning improves the efficiency of confidence estimation in early-exit LLMs while maintaining competitive…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Second Language Acquisition and Learning · Text Readability and Simplification
MethodsPruning
