Winner-Take-All Column Row Sampling for Memory Efficient Adaptation of   Language Model

Zirui Liu; Guanchu Wang; Shaochen Zhong; Zhaozhuo Xu; Daochen Zha,; Ruixiang Tang; Zhimeng Jiang; Kaixiong Zhou; Vipin Chaudhary; Shuai Xu; Xia; Hu

arXiv:2305.15265·cs.LG·November 8, 2024·2 cites

Winner-Take-All Column Row Sampling for Memory Efficient Adaptation of Language Model

Zirui Liu, Guanchu Wang, Shaochen Zhong, Zhaozhuo Xu, Daochen Zha,, Ruixiang Tang, Zhimeng Jiang, Kaixiong Zhou, Vipin Chaudhary, Shuai Xu, Xia, Hu

PDF

Open Access 1 Repo

TL;DR

This paper introduces WTA-CRS, an unbiased estimator for matrix multiplication that reduces memory usage during transformer fine-tuning, enabling larger batch sizes and faster training with minimal accuracy loss.

Contribution

The paper proposes WTA-CRS, a novel unbiased estimator for matrix multiplication that reduces variance and memory consumption in transformer training.

Findings

01

Achieves up to 2.7× memory reduction with minimal accuracy loss.

02

Enables up to 6.4× larger batch sizes during training.

03

Improves downstream task performance with larger models and faster training.

Abstract

With the rapid growth in model size, fine-tuning the large pre-trained language model has become increasingly difficult due to its extensive memory usage. Previous works usually focus on reducing the number of trainable parameters in the network. While the model parameters do contribute to memory usage, the primary memory bottleneck during training arises from storing feature maps, also known as activations, as they are crucial for gradient calculation. Notably, neural networks are usually trained using stochastic gradient descent. We argue that in stochastic optimization, models can handle noisy gradients as long as the gradient estimator is unbiased with reasonable variance. Following this motivation, we propose a new family of unbiased estimators called WTA-CRS, for matrix production with reduced variance, which only requires storing the sub-sampled activations for calculating the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

zirui-ray-liu/wtacrs
jaxOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Domain Adaptation and Few-Shot Learning · Stochastic Gradient Optimization Techniques

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · Focus