When and Where to Reset Matters for Long-Term Test-Time Adaptation

Taejun Lim; Joong-Won Hwang; Kibok Lee

arXiv:2603.03796·cs.LG·March 5, 2026

When and Where to Reset Matters for Long-Term Test-Time Adaptation

Taejun Lim, Joong-Won Hwang, Kibok Lee

PDF

Open Access 3 Reviews

TL;DR

This paper introduces an adaptive reset strategy for long-term test-time adaptation that selectively resets parts of the model, uses importance-aware regularization to retain crucial knowledge, and dynamically adjusts adaptation to improve robustness under domain shifts.

Contribution

It proposes a novel adaptive and selective reset method with importance-aware regularization and on-the-fly adjustment, addressing limitations of previous reset strategies in long-term TTA.

Findings

01

Significantly reduces model collapse in long-term TTA

02

Improves adaptation performance under challenging domain shifts

03

Outperforms existing reset strategies on benchmark datasets

Abstract

When continual test-time adaptation (TTA) persists over the long term, errors accumulate in the model and further cause it to predict only a few classes for all inputs, a phenomenon known as model collapse. Recent studies have explored reset strategies that completely erase these accumulated errors. However, their periodic resets lead to suboptimal adaptation, as they occur independently of the actual risk of collapse. Moreover, their full resets cause catastrophic loss of knowledge acquired over time, even though such knowledge could be beneficial in the future. To this end, we propose (1) an Adaptive and Selective Reset (ASR) scheme that dynamically determines when and where to reset, (2) an importance-aware regularizer to recover essential knowledge lost due to reset, and (3) an on-the-fly adaptation adjustment scheme to enhance adaptability under challenging domain shifts. Extensive…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 4Confidence 3

Strengths

- empirical performance of the proposed algorithm outperforms the considered methods - results are clearly presented, and the paper is easy to follow - considered experiments are well presented, relevant ablations performed

Weaknesses

The method is overall incremental, and the components of the method are heuristic fixes to the collapse problem. The scientific depth of the study is limited; I see limited value between what is already known in the field. The method fundamentally does not overcome the issue that reset is required to prevent collapse in longer-term test-time adaptation. While there are certaintly gains over the state of the art, they seem marginal and the amount of engineering to get these 1%-point improvements

Reviewer 02Rating 6Confidence 4

Strengths

The studied problem of when and where to reset is a very practical problem, and it is a key step to enhance the stability of TTA under long-term and large-scale real-world application settings. The proposed methods (from fixed or heuristic resets to a data-driven and risk-aware reset strategy) are simple yet effective. The combination of adaptive reset timing, layer-wise selective reset, and Fisher-based knowledge recovery makes the overall method cohesive and effective.

Weaknesses

The proposed method involves many hyperparameters, making it potentially difficult to tune in real-world online testing scenarios. Could the authors clarify how these hyperparameters are determined and whether ASR is sensitive to them?

Reviewer 03Rating 4Confidence 4

Strengths

The primary strength is the novel ASR mechanism, which offers a more motivated and flexible alternative to naive periodic resets by dynamically linking the reset trigger and scope to a quantifiable measure of model collapse (prediction concentration). This adaptive approach is intuitive and addresses a clear limitation of prior work. The method is validated by extensive experiments on long-term TTA benchmarks, demonstrating significant performance gains, particularly on the challenging CCC-Hard

Weaknesses

1. The definition of "prediction concentration" (Eq. 1), which is central to the reset trigger, is based on the entropy of average logits. This metric seems sensitive to factors not fully explored: it's unclear if it's robust to logits of different magnitudes, and its dependency on batch composition (batch size, class distribution) is a concern. Furthermore, the supporting correlation in Figure 3 lacks context, as the dataset and settings used to generate it are not specified. Moreover, the pap

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDomain Adaptation and Few-Shot Learning · Generative Adversarial Networks and Image Synthesis · Topic Modeling