PonderLM-3: Adaptive Token-Wise Pondering with Differentiable Masking

He Li; Feichen Song; Boyi Zeng; Shixiang Song; Zhiqin John Xu; Ziwei He; Zhouhan Lin

arXiv:2603.02023·cs.CL·March 11, 2026

PonderLM-3: Adaptive Token-Wise Pondering with Differentiable Masking

He Li, Feichen Song, Boyi Zeng, Shixiang Song, Zhiqin John Xu, Ziwei He, Zhouhan Lin

PDF

Open Access

TL;DR

PonderLM-3 introduces a self-supervised, token-wise adaptive pondering framework that learns to allocate extra inference computation selectively per token, improving efficiency and performance over fixed computation models.

Contribution

It presents a novel differentiable masking approach for token-wise adaptive pondering, enabling learnable, selective computation allocation during inference.

Findings

01

Achieves lower perplexity at equal FLOPs compared to recursive baselines.

02

Attains comparable downstream performance with fewer inference FLOPs.

03

Provides an end-to-end train-inference consistent adaptive computation framework.

Abstract

Test-time scaling has shown that allocating more additional computation at inference can improve generation quality, motivating a natural follow-up question: where should this computation be spent? Building on this insight, we introduce PonderLM-3, a pretraining framework for token-wise adaptive pondering that learns to selectively allocate additional computation under purely self-supervised objectives, built on top of the PonderLM-2 backbone. This makes additional inference computation an allocatable per-token resource, so tokens receive more computation only when it is beneficial, rather than paying a uniform extra cost. To make this allocation learnable while maintaining train-inference consistency, PonderLM-3 injects a differentiable attention mask during pretraining and pairs it with a matching hard pruning rule at inference. PonderLM-3 defines a stronger Pareto frontier: compared…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Domain Adaptation and Few-Shot Learning · Advanced Neural Network Applications