A margin-based replacement for cross-entropy loss
Michael W. Spratling, Heiko H. Sch\"utt

TL;DR
The paper introduces HEM loss, a margin-based alternative to cross-entropy, which improves robustness and generalization across various classification tasks without sacrificing much accuracy.
Contribution
The authors propose high error margin (HEM) loss, a versatile margin-based loss function that outperforms or matches specialized losses across multiple classification challenges.
Findings
HEM loss improves robustness in adversarial settings.
HEM performs well in imbalanced data and continual learning.
HEM is a general-purpose replacement for cross-entropy loss.
Abstract
Cross-entropy (CE) loss is the de-facto standard for training deep neural networks to perform classification. However, CE-trained deep neural networks struggle with robustness and generalisation issues. To alleviate these issues, we propose high error margin (HEM) loss, a variant of multi-class margin loss that overcomes the training issues of other margin-based losses. We evaluate HEM extensively on a range of architectures and datasets. We find that HEM loss is more effective than cross-entropy loss across a wide range of tasks: unknown class rejection, adversarial robustness, learning with imbalanced data, continual learning, and semantic segmentation (a pixel-level classification task). Despite all training hyper-parameters being chosen for CE loss, HEM is inferior to CE only in terms of clean accuracy and this difference is insignificant. We also compare HEM to specialised losses…
Peer Reviews
Decision·Submitted to ICLR 2026
- The paper is well written. - The proposed approach is technically sound. - The empirical experiment conducted span a wide range of datasets. - The proposed method seem to work on a wide range of problems.
- The proposed method seems to lack some theoretical justification. Some theoretical analysis on the proposed loss function, and on why the proposed HEM loss is better than regular margin loss can further strengthen the paper. - The claim to replace CCE loss is somewhat aggressive to me. From figure 2, the proposed loss function still underperforms CCE loss in the clean-data scenarios pretty significantly. - Following up on the previous point, in order to claim HEM as a "replacement" for CCE l
The paper proposes the High Error Margin (HEM) loss, a margin-based alternative to cross- entropy (CE) for classification tasks. The motivation is clear: CE has well-documented limitations, including non-zero penalties for correctly classified samples and a tendency toward overconfident predictions on unseen data, leading to mis-calibration, poor robustness to out-of-distribution detection, and catastrophic forgetting. HEM addresses these issues by adaptively averaging high-error logits (those
The claim of general superiority is somewhat too broad. Figures 1 and 2 report a large number of experiments, but the main text provides insufficient explanation of these results and their configurations, with many key details relegated to the appendices. Given the wide range of applications tested, comparing only three alternative losses (apart from CE and MM losses) may be insufficient. For imbalanced data, in the literature CE is often used in combination with other techniques—such as rando
- The authors address an interesting topic. - I like that, in addition to the standard classification, other aspects such as unbalanced data, adversarial robustness, continual learning, and semantic segmentation are also considered. - The small toy example (Table 1) is useful for illustrating the differences between or problems with losses. - The paper is easy to read and understand. In particular, I think section 2 is appropriate as motivation. I find Appendix A useful for explaining the vario
- In Figure 1, I would add information about how many different datasets and networks were averaged. - The main paper states that 18 networks and 18 datasets are used. I find this somewhat misleading, as 5 datasets and 3 networks are initially used for the classification experiments (and another 4 for semantic segmentation), and the other datasets only represent extensions for the respective task. - For the attack experiments, the MSP or MLS is used for the thresholding. Entropy would also be in
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsOptical Network Technologies
