Trainable Weight Averaging: Accelerating Training and Improving Generalization
Tao Li, Zhehao Huang, Yingwen Wu, Zhengbao He, Qinghua Tao, Xiaolin, Huang, Chih-Jen Lin

TL;DR
Trainable Weight Averaging (TWA) is a new method that learns optimal weight combinations to accelerate training and improve the generalization of deep neural networks, outperforming existing averaging techniques.
Contribution
We introduce TWA, a flexible, trainable weight averaging method that learns optimal weights within a subspace, and develop a distributed framework for large-scale applications.
Findings
TWA outperforms SWA in generalization and flexibility.
Applying TWA during early training reduces training time by over 40% on CIFAR and 30% on ImageNet.
TWA enhances generalization during fine-tuning through weighted checkpoint averaging.
Abstract
Weight averaging is a widely used technique for accelerating training and improving the generalization of deep neural networks (DNNs). While existing approaches like stochastic weight averaging (SWA) rely on pre-set weighting schemes, they can be suboptimal when handling diverse weights. We introduce Trainable Weight Averaging (TWA), a novel optimization method that operates within a reduced subspace spanned by candidate weights and learns optimal weighting coefficients through optimization. TWA offers greater flexibility and can be applied to different training scenarios. For large-scale applications, we develop a distributed training framework that combines parallel computation with low-bit compression for the projection matrix, effectively managing memory and computational demands. TWA can be implemented using either training data (TWA-t) or validation data (TWA-v), with the latter…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Domain Adaptation and Few-Shot Learning · Human Pose and Action Recognition
MethodsStochastic Weight Averaging · Stochastic Gradient Descent
