Why is SAM Robust to Label Noise?
Christina Baek, Zico Kolter, Aditi Raghunathan

TL;DR
This paper investigates why Sharpness-Aware Minimization (SAM) is particularly effective under label noise, revealing that its robustness mainly stems from its influence on the network Jacobian rather than explicit logit weighting.
Contribution
The paper provides a theoretical analysis of SAM's robustness to label noise, highlighting the Jacobian effect as the key factor, and proposes cheaper alternatives that replicate its benefits.
Findings
SAM's robustness is primarily due to its effect on the network Jacobian.
Explicit logit weighting in SAM does not significantly impact performance.
Cheaper methods mimicking SAM's Jacobian effect achieve similar robustness.
Abstract
Sharpness-Aware Minimization (SAM) is most known for achieving state-of the-art performances on natural image and language tasks. However, its most pronounced improvements (of tens of percent) is rather in the presence of label noise. Understanding SAM's label noise robustness requires a departure from characterizing the robustness of minimas lying in "flatter" regions of the loss landscape. In particular, the peak performance under label noise occurs with early stopping, far before the loss converges. We decompose SAM's robustness into two effects: one induced by changes to the logit term and the other induced by changes to the network Jacobian. The first can be observed in linear logistic regression where SAM provably up-weights the gradient contribution from clean examples. Although this explicit up-weighting is also observable in neural networks, when we intervene and modify SAM to…
Peer Reviews
Decision·ICLR 2024 poster
Understanding the effect of SAM is of paramount interest due to the popularity of this technique. The baselines and the experiments are designed to directly answer the questions. The theory, although rather simple, is not known nor trivial. The related work is adequately covered.
The following concerns are the reasons for the low score and I can raise the score if all three are addressed. **1. Little evidence on the role of early stopping.** Most of the narrative highlights that SAM is especially effective when combined with early stopping. The importance of early stopping in the analysis is emphasized throughout the paper. However, when I look at the ResNet experiments in Fig 1 and 3, early stopping seems to have little to no effect, and the difference in performance i
- Provide refreshing insights on robustness of SAM to input labels through the lens of implicit regularization - Overall the paper is well written and is easy to follow
- No analysis/empirical demonstrations on tasks other than classification are provided (e.g., regression tasks) - Missing discussions/analysis on how the robustness benefits depend on parameters such as number of parameters, number of training samples, learning rate , etc. (see also Questions below) - Missing some references in Related Work, e.g.: https://arxiv.org/abs/1609.04836, https://arxiv.org/abs/1705.10694
* **Originality.** Although the robustness of SAM towards label noise has been discussed, this paper shows surprisingly logit effect is in fact not important for this robustness. * **Clarity.** The paper is well-written and easy to read. * **Significance.** The paper examines an interesting and important question in understanding SAM.
* Equation 4.5 includes a stop gradient operator in a minimization target, which, to the reviewer's knowledge, is a non-standard way of writing. The reviewer would recommend to rephrase into an update rule.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsFault Detection and Control Systems · Infrastructure Maintenance and Monitoring · Advanced Chemical Sensor Technologies
MethodsSegment Anything Model · Logistic Regression
