Change of Thought: Adaptive Test-Time Computation

Mrinal Mathur; Mike Doan; Barak Pearlmutter; Sergey Plis

arXiv:2507.13569·cs.LG·July 21, 2025

Change of Thought: Adaptive Test-Time Computation

Mrinal Mathur, Mike Doan, Barak Pearlmutter, Sergey Plis

PDF

Open Access 3 Reviews

TL;DR

The paper introduces SELF-Transformer, an encoder layer that adaptively refines attention weights at test time, significantly improving accuracy on benchmarks without increasing model size.

Contribution

It proposes a novel self-iterative attention mechanism that enhances expressive power of encoder Transformers through input-adaptive refinement at test time.

Findings

01

Up to 20% accuracy improvements on encoder benchmarks.

02

Achieves these gains without increasing parameter count.

03

Demonstrates the effectiveness of input-adaptive attention refinement.

Abstract

Transformers evaluated in a single, fixed-depth pass are provably limited in expressive power to the constant-depth circuit class TC0. Running a Transformer autoregressively removes that ceiling -- first in next-token prediction and, more recently, in chain-of-thought reasoning. Both regimes rely on feedback loops that decode internal states into tokens only to re-encode them in subsequent steps. While this "thinking aloud" mirrors human reasoning, biological brains iterate without externalising intermediate states as language. To boost the expressive power of encoder Transformers without resorting to token-level autoregression, we introduce the SELF-Transformer: an encoder layer that iteratively refines its own attention weights to a fixed point. Instead of producing -- in one pass -- the alignment matrix that remixes the input sequence, the SELF-Transformer iteratively updates that…

Peer Reviews

Decision·ICLR 2026 Conference Withdrawn Submission

Reviewer 01Rating 6Confidence 3

Strengths

1. The paper offers a more granular approach to adaptive computation compared to prior work that typically repeats entire blocks. 2. The use of implicit differentiation is a major technical strength, making the iterative approach practical by avoiding the memory explosion that would occur with standard backpropagation through time. The compute-matched comparisons (Appendix G) is nice.

Weaknesses

1. The claim of being parameter-free is a bit misleading. While FPSA adds no new learnable model weights, it introduces several crucial hyperparameters that require tuning: the convergence tolerance $\epsilon$, the maximum number of iterations $K_max$, and the gradient clipping threshold. The learned halting variant (FPSA-LH) further adds a small gating MLP and a ponder cost hyperparameter. The paper lacks a sensitivity analysis for these hyperparameters, which seem critical to the method's perf

Reviewer 02Rating 4Confidence 4

Strengths

- The proposed approach seems quite reasonable and good, based on fixed point iteration method. - This paper has comprehensive analysis and results, to show how FPSA actually works and the performance under various downstream tasks.

Weaknesses

- First, I feel like the overall presentation could be much enhanced. Some useful explanation and experimental results are hidden in Appendix parts (e.g., Fixed-point iteration is not explained in main body properly, some qualitative results as well). It would be good to re-place contents clearly for the reader. - As the author mentioned, this mechanism seems having high relation to recursive / looped transformer architecture. There is lack of discussion to recent papers. And I'm curious about c

Reviewer 03Rating 2Confidence 4

Strengths

1. The proposed structure is simple and maintains constant memory. Iterative refinement within the attention sublayer trained via implicit differentiation; avoids storing the inner unroll and heavy checkpointing. Architectural simplicity is preserved. 2. The proposed structure improves the performance with the same parameters. It improves size-matched encoder baselines on GLUE/SQuAD and shows benefits for ViT/VL, supporting generality. 3. The writing of the paper is clear. It clearly conveys

Weaknesses

1. The main idea of the proposed iterative structure is to scale the computation. In other words, the performance gain comes with the increased training and inference computation. The training cost is not reported in the current work. The inference computation comparison is missing. Almost all the results in the paper show that the proposed structure performs better than vanilla attention with more computation. However, a fair comparison is to constrain the computation budget of both models, whi

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsFerroelectric and Negative Capacitance Devices · Advanced Memory and Neural Computing · Neural Networks and Reservoir Computing