COME: Test-time adaption by Conservatively Minimizing Entropy

Qingyang Zhang; Yatao Bian; Xinke Kong; Peilin Zhao; Changqing; Zhang

arXiv:2410.10894·stat.ML·October 16, 2024

COME: Test-time adaption by Conservatively Minimizing Entropy

Qingyang Zhang, Yatao Bian, Xinke Kong, Peilin Zhao, Changqing, Zhang

PDF

Open Access 3 Reviews

TL;DR

COME introduces a conservative entropy minimization approach for test-time adaptation, modeling uncertainty with a Dirichlet prior to improve model stability and accuracy on open-world data.

Contribution

It proposes a novel method that replaces traditional entropy minimization with a Dirichlet-based regularization to address overconfidence in test-time adaptation.

Findings

01

Achieves state-of-the-art accuracy improvements up to 34.5%.

02

Reduces false positive rate by up to 15.1%.

03

Enhances stability and uncertainty estimation in various TTA settings.

Abstract

Machine learning models must continuously self-adjust themselves for novel data distribution in the open world. As the predominant principle, entropy minimization (EM) has been proven to be a simple yet effective cornerstone in existing test-time adaption (TTA) methods. While unfortunately its fatal limitation (i.e., overconfidence) tends to result in model collapse. For this issue, we propose to Conservatively Minimize the Entropy (COME), which is a simple drop-in replacement of traditional EM to elegantly address the limitation. In essence, COME explicitly models the uncertainty by characterizing a Dirichlet prior distribution over model predictions during TTA. By doing so, COME naturally regularizes the model to favor conservative confidence on unreliable samples. Theoretically, we provide a preliminary analysis to reveal the ability of COME in enhancing the optimization stability by…

Peer Reviews

Decision·ICLR 2025 Poster

Reviewer 01Rating 8Confidence 4

Strengths

This paper accurately spots the paradox of EM's learning objective: minimization of entropy leads to over-confidence. And the paper proposes a simple yet effective solution to minimize entropy with respect to a probability distribution that faithfully estimates the uncertainty without over-confidence. It is a very reasonable idea to differentiate between the statistics used for prediction and for uncertainty estimation, which has long been considered the same in the TTA literature. Therefore, th

Weaknesses

These are not necessarily weaknesses but rather some questions that I would like to confirm with the author. 1. How does the algorithm ensure that $b_k$ is non-negative for the computation of entropy, since $b_k$ is implemented as $(e^{f_k(x)}-1)/ \sum_{k'} e^{f_k'(x)}$ which could be negative? 2. Why does the algorithm keep $u$ close to $u_0$? Does it imply that the uncertainty estimation for the pretrained model is trusted? What if the pretrained model is over-confident? What about the alter

Reviewer 02Rating 5Confidence 5

Strengths

- This paper is well-motivated, and the story makes sense. - Extensive experiments have been done to support the proposed method.

Weaknesses

My major concerns include: - For the proposed method: Why Dirichlet distribution is used? How is the Dirichlet distribution related to the final algorithm in Algorithm 1. In addition, what is the role of delta in Algorithm 1? It seems that the authors tell a long story about their algorithm, but the algorithm itself is rather simple. - For the theoretical analysis: Could the authors provide a more detailed (theoretical) comparison between the proposed method and traditional EM? What is the benef

Reviewer 03Rating 6Confidence 3

Strengths

The proposed algorithm introduces a rejection mechanism for unreliable samples in the TTA process, preventing the model from learning from potentially noisy labeled data. It is simple to integrate into existing TTA frameworks, and the experimental results indicate satisfactory performance.

Weaknesses

The proposed method and its theoretical analysis rely heavily on existing techniques, which limits its technical novelty. The core concept shares some similarity with research on learning with rejection. It is recommended to discuss how the proposed loss function compares with the loss functions used in learning with rejection, as outlined in [1]. There is a lack of experiments involving real-world applications with distribution shifts, as exemplified in [2]. Testing the proposed algorithm on

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Vision and Imaging · Advanced MRI Techniques and Applications · Sparse and Compressive Sensing Techniques