Frustratingly Easy Model Generalization by Dummy Risk Minimization
Juncheng Wang, Jindong Wang, Xixu Hu, Shujun Wang, Xing Xie

TL;DR
This paper introduces Dummy Risk Minimization (DuRM), a simple yet effective method to enhance the generalization of empirical risk minimization by enlarging output logits, validated through theoretical and extensive empirical analysis across diverse tasks.
Contribution
The paper proposes DuRM, a straightforward technique that improves ERM generalization by increasing output dimension, supported by theoretical insights and broad empirical validation.
Findings
DuRM increases gradient variance, aiding flat minima discovery.
DuRM consistently improves performance across various tasks and datasets.
DuRM is compatible with existing generalization methods.
Abstract
Empirical risk minimization (ERM) is a fundamental machine learning paradigm. However, its generalization ability is limited in various tasks. In this paper, we devise Dummy Risk Minimization (DuRM), a frustratingly easy and general technique to improve the generalization of ERM. DuRM is extremely simple to implement: just enlarging the dimension of the output logits and then optimizing using standard gradient descent. Moreover, we validate the efficacy of DuRM on both theoretical and empirical analysis. Theoretically, we show that DuRM derives greater variance of the gradient, which facilitates model generalization by observing better flat local minima. Empirically, we conduct evaluations of DuRM across different datasets, modalities, and network architectures on diverse tasks, including conventional classification, semantic segmentation, out-of-distribution generalization, adverserial…
Peer Reviews
Decision·ICLR 2024 Conference Withdrawn Submission
- The paper presents lots of empirical experiments & evidence to support the proposed method. - Authors considered diverse tasks, datasets, architectures, and settings and demonstrated DuRM outperforms in most cases. - DuRM is demonstrated to improve over other regularization methods, and the approach is straightforward to incorporate into existing training pipelines. This makes it practically appealing. Overall, the method is simple, and the authors presented a holistic analysis of their ap
- The author tried to explain the behavior of DuRM with some theory, but the assumptions are weak and unrealistic, and the arguments are unclear. - Eq 4: $p_c$ is assumed to be a mixture of Gaussian distribution. However, $0 \leq p_c \leq 1$ and assuming Gaussian distribution allow non-zero probabilities to negative or more than 1 confidence. - Similarly, for Thm 2, $g$ is assumed to be Gaussian instead of a mixture of Gaussian. The justification in Footnote 1 is unclear. What is implied b
1. The proposed method performs consistently on a variety of tasks, such as image classification, domain generalisation, and semantic segmentation with different architectures. 2. The paper is well-written and easy to follow the main idea. But still, some details are not clear and sometimes confusing. I will discuss them in the later sections. 3. The idea provides a novel view of the variance of gradient which is oppositive to previous variance work. For example: Johnson R, Zhang T. Accelera
1. When taking the mean and std of both ERM and DuRM, the performance distribution overlaps a lot. Then the improvement is not significant. 2. The intuition of the theoretical results is not very presented. 3. Hyper parameter effect is not well discussed in the submission. Not sure how it affects as the improvement is not significant. DomainBed [1] eliminates the effects of hyperparameter settings by humans. [1] Gulrajani I, Lopez-Paz D. In search of lost domain generalization. arXiv preprin
The proposed DuRM is indeed very simple. If it really works then could be interesting.
My biggest concern is that many claims in the submission are not very sound, and the proposed method is not sufficiently justified either theoretically or empirically. After reading the manuscript, I am not convinced at all that DuRM could really improve over ERM, except that it provides more hyperparameters so increases the degree of freedom for hyperparameter tuning. Thus, I recommend rejecting this submission. Here are my detailed comments: ### 1. Regarding the usage of the term "generalizat
The proposed paradigm is interesting and might bring new insight to the ML community.
A significant weakness of this paper is that its theoretical analyses are limited. Specifically, the theoretical results are highly dependent on the strong assumption (i.e., Gaussian distribution), which is insufficient for a model. It is better to see a result related to the underlying model parameters or training process. Another concern is the consistent attack manner for the adversarial robustness. In tab.5, are the models trained by ERM/DuRM attacked by their corresponding training loss? I
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDomain Adaptation and Few-Shot Learning · Machine Learning and Data Classification · Topic Modeling
