A margin-based replacement for cross-entropy loss

Michael W. Spratling; Heiko H. Sch\"utt

arXiv:2501.12191·cs.LG·January 22, 2025

A margin-based replacement for cross-entropy loss

Michael W. Spratling, Heiko H. Sch\"utt

PDF

Open Access 3 Reviews

TL;DR

The paper introduces HEM loss, a margin-based alternative to cross-entropy, which improves robustness and generalization across various classification tasks without sacrificing much accuracy.

Contribution

The authors propose high error margin (HEM) loss, a versatile margin-based loss function that outperforms or matches specialized losses across multiple classification challenges.

Findings

01

HEM loss improves robustness in adversarial settings.

02

HEM performs well in imbalanced data and continual learning.

03

HEM is a general-purpose replacement for cross-entropy loss.

Abstract

Cross-entropy (CE) loss is the de-facto standard for training deep neural networks to perform classification. However, CE-trained deep neural networks struggle with robustness and generalisation issues. To alleviate these issues, we propose high error margin (HEM) loss, a variant of multi-class margin loss that overcomes the training issues of other margin-based losses. We evaluate HEM extensively on a range of architectures and datasets. We find that HEM loss is more effective than cross-entropy loss across a wide range of tasks: unknown class rejection, adversarial robustness, learning with imbalanced data, continual learning, and semantic segmentation (a pixel-level classification task). Despite all training hyper-parameters being chosen for CE loss, HEM is inferior to CE only in terms of clean accuracy and this difference is insignificant. We also compare HEM to specialised losses…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 6Confidence 3

Strengths

- The paper is well written. - The proposed approach is technically sound. - The empirical experiment conducted span a wide range of datasets. - The proposed method seem to work on a wide range of problems.

Weaknesses

- The proposed method seems to lack some theoretical justification. Some theoretical analysis on the proposed loss function, and on why the proposed HEM loss is better than regular margin loss can further strengthen the paper. - The claim to replace CCE loss is somewhat aggressive to me. From figure 2, the proposed loss function still underperforms CCE loss in the clean-data scenarios pretty significantly. - Following up on the previous point, in order to claim HEM as a "replacement" for CCE l

Reviewer 02Rating 4Confidence 4

Strengths

The paper proposes the High Error Margin (HEM) loss, a margin-based alternative to cross- entropy (CE) for classification tasks. The motivation is clear: CE has well-documented limitations, including non-zero penalties for correctly classified samples and a tendency toward overconfident predictions on unseen data, leading to mis-calibration, poor robustness to out-of-distribution detection, and catastrophic forgetting. HEM addresses these issues by adaptively averaging high-error logits (those

Weaknesses

The claim of general superiority is somewhat too broad. Figures 1 and 2 report a large number of experiments, but the main text provides insufficient explanation of these results and their configurations, with many key details relegated to the appendices. Given the wide range of applications tested, comparing only three alternative losses (apart from CE and MM losses) may be insufficient. For imbalanced data, in the literature CE is often used in combination with other techniques—such as rando

Reviewer 03Rating 4Confidence 4

Strengths

- The authors address an interesting topic. - I like that, in addition to the standard classification, other aspects such as unbalanced data, adversarial robustness, continual learning, and semantic segmentation are also considered. - The small toy example (Table 1) is useful for illustrating the differences between or problems with losses. - The paper is easy to read and understand. In particular, I think section 2 is appropriate as motivation. I find Appendix A useful for explaining the vario

Weaknesses

- In Figure 1, I would add information about how many different datasets and networks were averaged. - The main paper states that 18 networks and 18 datasets are used. I find this somewhat misleading, as 5 datasets and 3 networks are initially used for the classification experiments (and another 4 for semantic segmentation), and the other datasets only represent extensions for the respective task. - For the attack experiments, the MSP or MLS is used for the thresholding. Entropy would also be in

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsOptical Network Technologies