Reducing Class-Wise Performance Disparity via Margin Regularization
Beier Zhu, Kesen Zhao, Jiequan Cui, Qianru Sun, Yuan Zhou, Xun Yang, Hanwang Zhang

TL;DR
This paper introduces MR$^2$, a theoretically grounded regularization method that dynamically adjusts margins in neural networks to reduce class-wise accuracy disparities, especially improving performance on hard classes without sacrificing overall accuracy.
Contribution
The paper proposes a novel margin regularization technique, MR$^2$, with a theoretical analysis and practical implementation that effectively reduces class-wise performance disparity in neural networks.
Findings
MR$^2$ improves accuracy on hard classes across datasets.
The method reduces performance disparity without sacrificing easy class accuracy.
Experiments on ImageNet and other datasets validate the effectiveness of MR$^2$.
Abstract
Deep neural networks often exhibit substantial disparities in class-wise accuracy, even when trained on class-balanced data, posing concerns for reliable deployment. While prior efforts have explored empirical remedies, a theoretical understanding of such performance disparities in classification remains limited. In this work, we present Margin Regularization for Performance Disparity Reduction (MR), a theoretically principled regularization for classification by dynamically adjusting margins in both the logit and representation spaces. Our analysis establishes a margin-based, class-sensitive generalization bound that reveals how per-class feature variability contributes to error, motivating the use of larger margins for hard classes. Guided by this insight, MR optimizes per-class logit margins proportional to feature spread and penalizes excessive representation margins to…
Peer Reviews
Decision·ICLR 2026 Poster
The paper is clearly written, and the ideas it explores are relevant to our broader understanding of optimization. The theoretical outline is well presented and easy to follow, and the experimental setup is generally sound and consistent with prior work. The proposed approach makes a meaningful contribution by improving performance on hard classes without hurting the easier ones, leading to a more balanced overall accuracy across classes.
**W1)**: I believe this work, given its focus on margin geometry and embedding compactness, overlooks a closely related and highly relevant area known as Neural Collapse (Papyan et al., 2020). This phenomenon shows that in over-parameterized networks—such as those considered in this paper—class embeddings tend to collapse to a single prototype per class with maximal inter-class separation as training progresses. Subsequent works have analyzed Neural Collapse under class imbalance (Behnia et al.,
1. Paper introduces a theoretically motivated solution to the problem. 2. The paper is well structured with relevant experiments.
1. Experiments are done on an older setup: I find that the experiments are done on older SOTA setups. The newer setups, like Sharpness Aware Minimization (SAM) [R1], WideResNets, have not been considered for comparison. Hence, the performance reported for datasets like CIFAR-10 and ImageNet is much lower than the current SoTA. Further, the margin-based algorithms like LDAM, compared with MR2, perform much better when compared to SAM [R2]. 2. Missing Comparison: There are some contrastive learni
* By providing background on the class disparity problem—an increasingly critical issue in modern classification settings—and analyzing its underlying causes, this study effectively establishes the motivation for addressing this problem * The study also supports the validity of the proposed approach with solid theoretical analysis, comprising precisely stated theorems and corresponding proofs * Furthermore, extensive experiments conducted on a wide range of datasets, including fine-grained ben
**W1.** As illustrated in Eq. 13 (lines 286–291), there exists a trade-off between the first and second terms depending on the value of $\gamma$. Although this trade-off is bounded through Corollary 1, further tuning of the coefficient $\bar{c}$ is still required. This remains a tuning issue, in combination with another hyperparameter $\lambda$, which increases the overall burden of hyperparameter tuning. **W2.** The proposed approach indirectly verifies its effectiveness in addressing the clas
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDomain Adaptation and Few-Shot Learning · Advanced Neural Network Applications · Adversarial Robustness in Machine Learning
