PEER pressure: Model-to-Model Regularization for Single Source Domain Generalization

Dong Kyu Cho; Inwoo Hwang; Sanghack Lee

arXiv:2505.12745·cs.LG·May 20, 2025

PEER pressure: Model-to-Model Regularization for Single Source Domain Generalization

Dong Kyu Cho, Inwoo Hwang, Sanghack Lee

PDF

Open Access 4 Reviews

TL;DR

PEER introduces a model-to-model regularization technique that stabilizes training and improves single source domain generalization by using a proxy model to accumulate knowledge and reduce feature distortion.

Contribution

The paper proposes PEER, a novel regularization method employing a proxy model and mutual information maximization to enhance domain generalization and reduce performance fluctuation.

Findings

01

PEER reduces out-of-distribution performance fluctuation.

02

PEER achieves state-of-the-art results with simple augmentation.

03

PEER outperforms prior methods on multiple datasets.

Abstract

Data augmentation is a popular tool for single source domain generalization, which expands the source domain by generating simulated ones, improving generalization on unseen target domains. In this work, we show that the performance of such augmentation-based methods in the target domains universally fluctuates during training, posing challenges in model selection under realistic scenarios. We argue that the fluctuation stems from the inability of the model to accumulate the knowledge learned from diverse augmentations, exacerbating feature distortion during training. Based on this observation, we propose a novel generalization method, coined Parameter-Space Ensemble with Entropy Regularization (PEER), that uses a proxy model to learn the augmented data on behalf of the main model. The main model is updated by averaging its parameters with the proxy model, progressively accumulating…

Peer Reviews

Decision·ICLR 2025 Conference Withdrawn Submission

Reviewer 01Rating 3Confidence 4

Strengths

The overall structure is logical, with a clear flow from the identification of the problem to the proposed solution and subsequent empirical validation. The experimental setup and baselines are well described.

Weaknesses

1. The core idea of parameter averaging is not new, and the contribution may not be seen as significantly advancing beyond existing ensemble and regularization techniques. 2. The authors argue that data augmentation leads to feature distortion. However, it is unclear what feature distortion means, and its existence is not substantiated by theoretical or experimental analysis. 3. The authors also argue that better generalization correlates with larger fluctuation during training. However, the emp

Reviewer 02Rating 5Confidence 4

Strengths

The studied problem is practical and interesting. The overall findings are reasonable, and the proposed method appears technically sound. Compared with RandAug, the method achieves significant improvements.

Weaknesses

The paper still lacks key discussions and comparisons to clarify its originality and contributions. Please refer to my questions for detailed comments.

Reviewer 03Rating 6Confidence 3

Strengths

* The paper is well-written and well-motivated. * They investigate different components of their method thoroughly and in detail with different ablations. * Their method is simple but effective in most cases.

Weaknesses

* The novelty of their method is limited and incremental. They combined the Barlow twins [1] loss and teacher-student architecture methods (i.e., DINO [2]) in a supervised setting. * The connection between the variance of target domain accuracy and generalizability is missing. The primary motivation of the paper is to decrease the augmentation distortion to improve the model generalization (Table 1, figure 3, and Table 5), which they measure with OOD fluctuation. However, in Table 5, a pretrai

Reviewer 04Rating 3Confidence 5

Strengths

The authors show that the dissimilar the target domain is to the source domain, the larger the fluctuation. Also, the fluctuation becomes smaller by augmenting data to reduce the disimilarity.

Weaknesses

- The novelty of the parameter-averaging methodology is limited. Is the proposed methodology specialized for domain generalization? How could it differentiate it from other ensemble methods? - The author argues that the parameter-averaged task mode guides the proxy model's learning process, reducing the OOD fluctuation. However, the motivation behind the argument is not well explained. Is there any theoretical guarantee about this argument? - For the OOD fluctuation, the authors investigate two

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDomain Adaptation and Few-Shot Learning · Advanced Neural Network Applications · Face recognition and analysis

MethodsEntropy Regularization