Trainable Weight Averaging: Accelerating Training and Improving   Generalization

Tao Li; Zhehao Huang; Yingwen Wu; Zhengbao He; Qinghua Tao; Xiaolin; Huang; Chih-Jen Lin

arXiv:2205.13104·cs.LG·February 11, 2025

Trainable Weight Averaging: Accelerating Training and Improving Generalization

Tao Li, Zhehao Huang, Yingwen Wu, Zhengbao He, Qinghua Tao, Xiaolin, Huang, Chih-Jen Lin

PDF

Open Access 1 Repo

TL;DR

Trainable Weight Averaging (TWA) is a new method that learns optimal weight combinations to accelerate training and improve the generalization of deep neural networks, outperforming existing averaging techniques.

Contribution

We introduce TWA, a flexible, trainable weight averaging method that learns optimal weights within a subspace, and develop a distributed framework for large-scale applications.

Findings

01

TWA outperforms SWA in generalization and flexibility.

02

Applying TWA during early training reduces training time by over 40% on CIFAR and 30% on ImageNet.

03

TWA enhances generalization during fine-tuning through weighted checkpoint averaging.

Abstract

Weight averaging is a widely used technique for accelerating training and improving the generalization of deep neural networks (DNNs). While existing approaches like stochastic weight averaging (SWA) rely on pre-set weighting schemes, they can be suboptimal when handling diverse weights. We introduce Trainable Weight Averaging (TWA), a novel optimization method that operates within a reduced subspace spanned by candidate weights and learns optimal weighting coefficients through optimization. TWA offers greater flexibility and can be applied to different training scenarios. For large-scale applications, we develop a distributed training framework that combines parallel computation with low-bit compression for the projection matrix, effectively managing memory and computational demands. TWA can be implemented using either training data (TWA-t) or validation data (TWA-v), with the latter…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

nblt/twa
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Domain Adaptation and Few-Shot Learning · Human Pose and Action Recognition

MethodsStochastic Weight Averaging · Stochastic Gradient Descent