Switch EMA: A Free Lunch for Better Flatness and Sharpness
Siyuan Li, Zicheng Liu, Juanxi Tian, Ge Wang, Zedong Wang, Weiyang, Jin, Di Wu, Cheng Tan, Tao Lin, Yang Liu, Baigui Sun, and Stan Z. Li

TL;DR
Switch EMA (SEMA) is a simple modification to exponential moving average that improves neural network flatness, sharpness, and generalization without extra costs, across diverse tasks and models.
Contribution
The paper introduces SEMA, a straightforward method that switches EMA parameters to the original model after each epoch, enhancing flatness and generalization in DNN training.
Findings
SEMA improves performance across vision and language tasks.
SEMA accelerates convergence speeds.
SEMA enhances the flatness-sharpness trade-off.
Abstract
Exponential Moving Average (EMA) is a widely used weight averaging (WA) regularization to learn flat optima for better generalizations without extra cost in deep neural network (DNN) optimization. Despite achieving better flatness, existing WA methods might fall into worse final performances or require extra test-time computations. This work unveils the full potential of EMA with a single line of modification, i.e., switching the EMA parameters to the original model after each epoch, dubbed as Switch EMA (SEMA). From both theoretical and empirical aspects, we demonstrate that SEMA can help DNNs to reach generalization optima that better trade-off between flatness and sharpness. To verify the effectiveness of SEMA, we conduct comparison experiments with discriminative, generative, and regression tasks on vision and language datasets, including image classification, self-supervised…
Peer Reviews
Decision·Submitted to ICLR 2025
The authors conduct comprehensive experiments in diverse areas such as image classification, self-supervised learning, object detection, video prediction, and language modeling, highlighting the advantages of the proposed method.
Although the proposed method demonstrates superior performance, I believe this paper is not yet ready for publication for the following several reasons. ## **Sharpness vs. Flatness?** The concepts of sharpness and flatness are unclear and deviate from conventional terminology. The authors argue that sharpness measures the depth of local minima while flatness assesses their width. However, I believe that when evaluating local minima in terms of depth and width, it is essential to consider the re
- The paper proposes an practical algorithm that is widely applicable and stably improves the performance of optimization of neural networks using stochastic gradient descent. - The paper investigates the theoretical properties have been well analyzed to investigate why a simple EMA improvement is effective. - The effectiveness of the proposed algorithm is evaluated through extensive experiments.
- Although the experiments provided in the paper are basically extensive, further verification of the smoother decision boundary by SEMA discussed in Sec. 3.2 would be beneficial to the reader. The paper mentions the performance against fine-grained discrimination in L268, but in fact, the main focus of the experiments is on coarse-grained evaluation. For example, the performance on FGVCAircraft [a], CUB [b], Stanford Cars [c], etc., which are often used to evaluate fine-grained classification,
1. The paper is very well written and it was easy for me to follow. 2. The paper focuses on the key issues of optimizers and creatively improves existing methods, presenting a novel approach. 3. The paper demonstrates the effectiveness of the method through extensive experiments and theoretical analysis.
1. The legend in the upper right corner of each curve plot in Figure 2 is quite small, requiring significant magnification to be readable. 2. There is no clear explanation of the data distribution visualization formats used in Figure 4 and Figure A.1.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoil Mechanics and Vehicle Dynamics · Postharvest Quality and Shelf Life Management · Plant Physiology and Cultivation Studies
