Change of Thought: Adaptive Test-Time Computation
Mrinal Mathur, Mike Doan, Barak Pearlmutter, Sergey Plis

TL;DR
The paper introduces SELF-Transformer, an encoder layer that adaptively refines attention weights at test time, significantly improving accuracy on benchmarks without increasing model size.
Contribution
It proposes a novel self-iterative attention mechanism that enhances expressive power of encoder Transformers through input-adaptive refinement at test time.
Findings
Up to 20% accuracy improvements on encoder benchmarks.
Achieves these gains without increasing parameter count.
Demonstrates the effectiveness of input-adaptive attention refinement.
Abstract
Transformers evaluated in a single, fixed-depth pass are provably limited in expressive power to the constant-depth circuit class TC0. Running a Transformer autoregressively removes that ceiling -- first in next-token prediction and, more recently, in chain-of-thought reasoning. Both regimes rely on feedback loops that decode internal states into tokens only to re-encode them in subsequent steps. While this "thinking aloud" mirrors human reasoning, biological brains iterate without externalising intermediate states as language. To boost the expressive power of encoder Transformers without resorting to token-level autoregression, we introduce the SELF-Transformer: an encoder layer that iteratively refines its own attention weights to a fixed point. Instead of producing -- in one pass -- the alignment matrix that remixes the input sequence, the SELF-Transformer iteratively updates that…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
1. The paper offers a more granular approach to adaptive computation compared to prior work that typically repeats entire blocks. 2. The use of implicit differentiation is a major technical strength, making the iterative approach practical by avoiding the memory explosion that would occur with standard backpropagation through time. The compute-matched comparisons (Appendix G) is nice.
1. The claim of being parameter-free is a bit misleading. While FPSA adds no new learnable model weights, it introduces several crucial hyperparameters that require tuning: the convergence tolerance $\epsilon$, the maximum number of iterations $K_max$, and the gradient clipping threshold. The learned halting variant (FPSA-LH) further adds a small gating MLP and a ponder cost hyperparameter. The paper lacks a sensitivity analysis for these hyperparameters, which seem critical to the method's perf
- The proposed approach seems quite reasonable and good, based on fixed point iteration method. - This paper has comprehensive analysis and results, to show how FPSA actually works and the performance under various downstream tasks.
- First, I feel like the overall presentation could be much enhanced. Some useful explanation and experimental results are hidden in Appendix parts (e.g., Fixed-point iteration is not explained in main body properly, some qualitative results as well). It would be good to re-place contents clearly for the reader. - As the author mentioned, this mechanism seems having high relation to recursive / looped transformer architecture. There is lack of discussion to recent papers. And I'm curious about c
1. The proposed structure is simple and maintains constant memory. Iterative refinement within the attention sublayer trained via implicit differentiation; avoids storing the inner unroll and heavy checkpointing. Architectural simplicity is preserved. 2. The proposed structure improves the performance with the same parameters. It improves size-matched encoder baselines on GLUE/SQuAD and shows benefits for ViT/VL, supporting generality. 3. The writing of the paper is clear. It clearly conveys
1. The main idea of the proposed iterative structure is to scale the computation. In other words, the performance gain comes with the increased training and inference computation. The training cost is not reported in the current work. The inference computation comparison is missing. Almost all the results in the paper show that the proposed structure performs better than vanilla attention with more computation. However, a fair comparison is to constrain the computation budget of both models, whi
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsFerroelectric and Negative Capacitance Devices · Advanced Memory and Neural Computing · Neural Networks and Reservoir Computing
