AdaPonderLM: Gated Pondering Language Models with Token-Wise Adaptive Depth
Shixiang Song, He Li, Zitong Wang, Boyi Zeng, Feichen Song, Yixuan Wang, Zhiqin John Xu, Ziwei He, Zhouhan Lin

TL;DR
AdaPonderLM introduces token-wise adaptive depth in language models, enabling efficient inference by dynamically halting computation per token, which improves resource allocation without sacrificing performance.
Contribution
It presents a self-supervised recurrent language model with learned token-wise early exiting and KV reuse, enhancing inference efficiency and adaptivity.
Findings
Reduces inference compute by about 10% with maintained perplexity.
Learns gates that allocate more computation to harder tokens.
Outperforms fixed pruning under equal FLOPs conditions.
Abstract
Test-time scaling via recurrent/iterative Transformers enables large language models to spend more computation at inference, but most pretrained recurrent LMs run a fixed number of iterations, wasting compute on easy tokens and lacking token-wise adaptivity. Following the core idea of Adaptive Computation Time(ACT) and Early Exit(EE), we propose AdaPonderLM, a self-supervised recurrent language model that learns token-wise early exiting during pretraining without manually tuned per-token/per-layer pruning ratios. AdaPonderLM uses iteration-specific MLP gates with a monotonic halting mask to decide when each token stops recurring, and introduces a KV reuse mechanism that reuses cached key/value states for halted tokens, ensuring train--test consistency and practical acceleration. Across Pythia backbones from 70M to 410M (pretraining) and up to 2.8B (continued pretraining), AdaPonderLM…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Machine Learning in Materials Science · Natural Language Processing Techniques
