Focus on Likely Classes for Test-Time Prediction
Johannes Schneider

TL;DR
This paper introduces two test-time fine-tuning methods that focus on likely classes to improve uncertain model predictions, demonstrating accuracy gains across text and image models without relying on hand-engineered augmentations.
Contribution
The paper proposes novel test-time fine-tuning techniques that refine predictions by focusing on likely classes, leveraging shared features among classes without auxiliary tasks.
Findings
Accuracy improvements on diverse models
Effective in both text and image domains
Refinement via gradient steps enhances predictions
Abstract
We ask: Can focusing on likely classes of a single, in-domain sample improve model predictions? Prior work argued ``no''. We put forward a novel rationale in favor of ``yes'': Sharedness of features among classes indicates their reliability for a single sample. We aim for an affirmative answer without using hand-engineered augmentations or auxiliary tasks. We propose two novel test-time fine-tuning methods to improve uncertain model predictions. Instead of greedily selecting the most likely class, we introduce an additional step, \emph{focus on the likely classes}, to refine predictions. By applying a single gradient descent step with a large learning rate, we refine predictions when an initial forward pass indicates high uncertainty. The experimental evaluation demonstrates accuracy gains for one of our methods on average, which emphasizes shared features among likely classes. The…
Peer Reviews
Decision·Submitted to ICLR 2026
**Simple and Practical**: Elegantly simple - single gradient step on logits when uncertainty is high. Uncertainty measure (difference between top two probabilities) requires no calibration. Architecture-agnostic with easy implementation (code in Appendix B.2). Requires only one extra forward-backward pass. **Broad Evaluation**: 70+ model-dataset pairs across vision (ImageNet on ResNet/DenseNet/EfficientNet/MobileNet/ViT) and language (GPT-2, Llama, QWEN, Fox-1, StableLM, Gemma on diverse corpor
**Limited Novelty**: Test-time gradient adaptation is established in TTA/domain adaptation. Main distinction (multiple likely classes vs. single class) is incremental. No comparison with existing TTA methods (Tent, TTT, MEMO) or calibration methods (temperature scaling, Platt scaling). Single-step optimization is a practical trick, not a conceptual advance. **Modest Gains Without Context**: Consistent 1-2% improvements but missing: (a) wall-clock time overhead measurements, (b) comparison with
1. Efficient single-step optimization with large LR approximates multi-step results, minimizing computational overhead. 2. Comprehensive evaluation across diverse models (e.g., ViTs, ResNets, LLMs like GPT-2, Llama) and datasets (ImageNet, OpenWebText, etc.), demonstrating consistent gains for iFo (e.g., up to 2.2% on WideResNet). 3. Ablations on hyperparameters (LR, uncertainty threshold, iterations) and comparisons (e.g., input tuning) provide thorough insights. 4. Practical applicability: No
1. Main concern: The method's primary motivation and approach seem to have appeared in prior TTA work [1] (Selective Label Enhancement Learning for Test-Time Adaptation, ICLR); authors need to further explain and strengthen the novelty and advantages of their method. 2. Images should be optimized for display and layout, preferably using vector graphics. 3. Some typos exist, e.g., line 266 has an extra ".".
Novel yet simple idea – The “focus on likely classes” concept is intuitive and differs from classical entropy minimization or confidence-based TTA. The method is tested on diverse image (CNNs/ViTs) and text (GPT-2, LLaMA-3, Gemma-3, etc.) models, showing broad empirical coverage. Only one gradient step and per-sample adaptation make the method computationally efficient.
Although the paper cites Tent and related works, it does not directly compare with them in experiments (e.g., Tent, TTT++, CoTTA, etc.). As a test-time method, more quantitative comparisons with test-time adaptation approaches would strengthen claims. The reported gains are relatively modest (e.g., +0.1–0.3%), which raises concerns about their practical significance. While the aggregated metrics (mean, standard deviation, and p-values) provide some support, they are not entirely convincing. In
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDomain Adaptation and Few-Shot Learning · Multimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis
