PonderLM-3: Adaptive Token-Wise Pondering with Differentiable Masking
He Li, Feichen Song, Boyi Zeng, Shixiang Song, Zhiqin John Xu, Ziwei He, Zhouhan Lin

TL;DR
PonderLM-3 introduces a self-supervised, token-wise adaptive pondering framework that learns to allocate extra inference computation selectively per token, improving efficiency and performance over fixed computation models.
Contribution
It presents a novel differentiable masking approach for token-wise adaptive pondering, enabling learnable, selective computation allocation during inference.
Findings
Achieves lower perplexity at equal FLOPs compared to recursive baselines.
Attains comparable downstream performance with fewer inference FLOPs.
Provides an end-to-end train-inference consistent adaptive computation framework.
Abstract
Test-time scaling has shown that allocating more additional computation at inference can improve generation quality, motivating a natural follow-up question: where should this computation be spent? Building on this insight, we introduce PonderLM-3, a pretraining framework for token-wise adaptive pondering that learns to selectively allocate additional computation under purely self-supervised objectives, built on top of the PonderLM-2 backbone. This makes additional inference computation an allocatable per-token resource, so tokens receive more computation only when it is beneficial, rather than paying a uniform extra cost. To make this allocation learnable while maintaining train-inference consistency, PonderLM-3 injects a differentiable attention mask during pretraining and pairs it with a matching hard pruning rule at inference. PonderLM-3 defines a stronger Pareto frontier: compared…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Domain Adaptation and Few-Shot Learning · Advanced Neural Network Applications
