SeWA: Selective Weight Average via Probabilistic Masking

Peng Wang; Shengchao Hu; Zerui Tao; Guoxia Wang; Dianhai Yu; Li Shen,; Quan Zheng; Dacheng Tao

arXiv:2502.10119·cs.LG·February 17, 2025

SeWA: Selective Weight Average via Probabilistic Masking

Peng Wang, Shengchao Hu, Zerui Tao, Guoxia Wang, Dianhai Yu, Li Shen,, Quan Zheng, Dacheng Tao

PDF

Open Access 3 Reviews

TL;DR

SeWA introduces an adaptive, probabilistic checkpoint selection method for weight averaging that improves model generalization and convergence with minimal hyperparameter tuning, validated across multiple domains.

Contribution

The paper presents SeWA, a novel probabilistic framework for checkpoint selection in weight averaging, reducing manual tuning and improving performance.

Findings

01

SeWA achieves better generalization with fewer checkpoints.

02

Theoretical bounds show sharper stability guarantees than SGD.

03

Experimental results confirm effectiveness across domains.

Abstract

Weight averaging has become a standard technique for enhancing model performance. However, methods such as Stochastic Weight Averaging (SWA) and Latest Weight Averaging (LAWA) often require manually designed procedures to sample from the training trajectory, and the results depend heavily on hyperparameter tuning. To minimize human effort, this paper proposes a simple yet efficient algorithm called Selective Weight Averaging (SeWA), which adaptively selects checkpoints during the final stages of training for averaging. Based on SeWA, we show that only a few points are needed to achieve better generalization and faster convergence. Theoretically, solving the discrete subset selection problem is inherently challenging. To address this, we transform it into a continuous probabilistic optimization framework and employ the Gumbel-Softmax estimator to learn the non-differentiable mask for…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 4Confidence 4

Strengths

1. This paper is well motivated and has strong theoretical backing. The paper provides both flatness convergence arguments and stability-based generalization bounds, comparing favorably with SGD, SWA, and LAWA under convex and non-convex assumptions. 2. The experiments span three domains (RL, vision, and text), showing consistent improvements across varied architectures and data distributions.

Weaknesses

1. while the paper emphasizes SeWA’s simplicity, its sample-based optimization introduces additional forward passes and probability updates that are not accounted for in the baseline comparisons. A fair comparison should including additional computational consumption on these. 2. Except for the RL experiments, most performance curves show nearly overlapping trajectories in the figures. The visual and quantitative margins are subtle in the curve plot only form. 3. The sample-based optimization us

Reviewer 02Rating 4Confidence 4

Strengths

The continuation of the weight averaging is a good idea on the class of weight averaging methods for improving the generalization ability of DNNs.

Weaknesses

1. The results are insufficient: except Table 2, the results (e.g.g, Figure 3 and 4 for image class-action and text classification) do not show distinctive improvements so the competitiveness lacks convincing supports; otherwise, the performance improvement look increment in general setups.  Besides, the evaluated network architectures are datasets can be extended for comprehensiveness. 2. As it claims its particular effectiveness on RL that may have more unstable training. Would it be possibl

Reviewer 03Rating 4Confidence 3

Strengths

* The paper provides tighter theoretical results for convex and non-convex settings (the proofs were not checked). * The paper uses standard assumptions of smoothness and Lipschitz. * The experiments were tested on diverse domains: image, text, and locomotion trajectories.

Weaknesses

* In practice, if you select the final model on validation at multiple checkpoints, you would need to run SeWA at each validation point (or at least repeatedly near the end), which can be expensive compared to the other approaches that do not require this procedure and can be used directly. The paper would benefit from reporting wall-clock time per SeWA run, relative to one training epoch, for the chosen $K$, $k$, $M$, and max_iterations in each experiment. Also, the paper should specify the num

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsEmotion and Mood Recognition · Context-Aware Activity Recognition Systems · Human Pose and Action Recognition

MethodsStochastic Gradient Descent · Stochastic Weight Averaging