AdaPonderLM: Gated Pondering Language Models with Token-Wise Adaptive Depth

Shixiang Song; He Li; Zitong Wang; Boyi Zeng; Feichen Song; Yixuan Wang; Zhiqin John Xu; Ziwei He; Zhouhan Lin

arXiv:2603.01914·cs.CL·March 12, 2026

AdaPonderLM: Gated Pondering Language Models with Token-Wise Adaptive Depth

Shixiang Song, He Li, Zitong Wang, Boyi Zeng, Feichen Song, Yixuan Wang, Zhiqin John Xu, Ziwei He, Zhouhan Lin

PDF

Open Access

TL;DR

AdaPonderLM introduces token-wise adaptive depth in language models, enabling efficient inference by dynamically halting computation per token, which improves resource allocation without sacrificing performance.

Contribution

It presents a self-supervised recurrent language model with learned token-wise early exiting and KV reuse, enhancing inference efficiency and adaptivity.

Findings

01

Reduces inference compute by about 10% with maintained perplexity.

02

Learns gates that allocate more computation to harder tokens.

03

Outperforms fixed pruning under equal FLOPs conditions.

Abstract

Test-time scaling via recurrent/iterative Transformers enables large language models to spend more computation at inference, but most pretrained recurrent LMs run a fixed number of iterations, wasting compute on easy tokens and lacking token-wise adaptivity. Following the core idea of Adaptive Computation Time(ACT) and Early Exit(EE), we propose AdaPonderLM, a self-supervised recurrent language model that learns token-wise early exiting during pretraining without manually tuned per-token/per-layer pruning ratios. AdaPonderLM uses iteration-specific MLP gates with a monotonic halting mask to decide when each token stops recurring, and introduces a KV reuse mechanism that reuses cached key/value states for halted tokens, ensuring train--test consistency and practical acceleration. Across Pythia backbones from 70M to 410M (pretraining) and up to 2.8B (continued pretraining), AdaPonderLM…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Machine Learning in Materials Science · Natural Language Processing Techniques