TL;DR
This paper presents a lightweight, entropy-guided framework for accurately predicting the output length of large language models, reducing computational waste and improving inference efficiency.
Contribution
It introduces a novel entropy-guided token pooling method and a dynamic length prediction approach that leverage internal model states, outperforming existing static prediction methods.
Findings
EGTP achieves 29.16% lower MAE than baselines.
The framework significantly improves end-to-end throughput in LLM inference.
ForeLen benchmark covers long-sequence, Chain-of-Thought, and RL data.
Abstract
The long-tailed distribution of sequence lengths in LLM serving and reinforcement learning (RL) sampling causes significant computational waste due to excessive padding in batched inference. Existing methods rely on auxiliary models for static length prediction, but they incur high overhead, generalize poorly, and fail in stochastic "one-to-many" sampling scenarios. We introduce a lightweight framework that reuses the main model's internal hidden states for efficient length prediction. Our framework features two core components: 1) Entropy-Guided Token Pooling (EGTP), which uses on-the-fly activations and token entropy for highly accurate static prediction with negligible cost, and 2) Progressive Length Prediction (PLP), which dynamically estimates the remaining length at each decoding step to handle stochastic generation. To validate our approach, we build and release ForeLen, a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
