TL;DR
PonderLM introduces a novel pondering process into language models, enabling them to perform deeper cognitive-like processing by repeatedly refining token embeddings through self-supervised learning, leading to improved performance across multiple benchmarks.
Contribution
This work pioneers the integration of a pondering mechanism into language models, enhancing their reasoning capabilities without requiring human annotations.
Findings
Pondering improves model performance on 9 downstream benchmarks.
PonderPythia-2.8B outperforms larger models like Pythia-6.9B.
PonderPythia-1B matches larger models trained on more data.
Abstract
Humans ponder before articulating complex sentence elements, enabling deeper cognitive processing through focused effort. In this work, we introduce this pondering process into language models by repeatedly invoking the forward process within a single token generation step. During pondering, instead of generating an actual token sampled from the prediction distribution, the model ponders by yielding a weighted sum of all token embeddings according to the predicted token distribution. The generated embedding is then fed back as input for another forward pass. We show that the model can learn to ponder in this way through self-supervised learning, without any human annotations. Experiments across three widely used open-source architectures-GPT-2, Pythia, and LLaMA-and extensive downstream task evaluations demonstrate the effectiveness and generality of our method. On 9 downstream…
Peer Reviews
Decision·ICLR 2026 Poster
The idea is simple yet effective, and it does not rely on any external supervision. The proposed mechanism is easy to implement and can be plugged into standard Transformer architectures with minimal changes. The empirical results are solid and sufficiently demonstrate the effectiveness of the proposed approach. The observation that performance improves monotonically with more pondering steps suggests that the method provides a controllable way to trade compute for performance.
Since the method introduces additional iterative passes beyond the standard forward pass, it incurs non-trivial training and inference cost, which may become particularly expensive for larger models and longer sequences.
Strong consistent results, good that it only needs general corpus data, comprehensive set of experiments.
**W1.** Limited novelty - a very simple change and very similar to prior methods. But perhaps this is not a weakness as the results seem good. **W2.** 4x compute at inference time. With LLMs actually being used now, inference cost is important. I think they should perhaps therefore be compared to 4x larger models which will have the same inference cost. In this case the performance is less strong.
1. The paper introduces a differentiable, self-supervised pondering loop that feeds a probability-weighted embedding back into the model within a single token generation step. This elegant mechanism eliminates the discrete bottleneck imposed by vocabulary spaces during internal computation. By conceptualizing pondering as a third scaling axis, orthogonal to both parameter scaling and test-time CoT scaling, the work offers a novel perspective on model scaling dynamics. Moreover, demonstrating tha
1. The paper’s motivation and theoretical foundation appear insufficiently developed. It remains unclear why repeating the forward pass within a single token-generation step should improve performance. The current justification—an analogy to human “slow thinking”—is conceptually interesting but lacks a mechanistic explanation or connection to established findings in neural or cognitive science. Providing a clearer rationale, ideally supported by formal analysis of how the proposed weighted-embed
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsPythia
