ResidualTransformer: Residual Low-Rank Learning with Weight-Sharing for   Transformer Layers

Yiming Wang; Jinyu Li

arXiv:2310.02489·cs.CL·January 9, 2024

ResidualTransformer: Residual Low-Rank Learning with Weight-Sharing for Transformer Layers

Yiming Wang, Jinyu Li

PDF

Open Access

TL;DR

ResidualTransformer introduces a weight-sharing and low-rank reparameterization method for Transformer layers, significantly reducing model size on speech tasks with minimal performance loss.

Contribution

The paper proposes a novel residual low-rank learning approach with weight-sharing for Transformer layers, inspired by ResNet and LoRA, to compress models efficiently.

Findings

01

Transformer encoder size reduced by ~3X

02

Achieved minimal performance degradation

03

Effective on large-scale speech tasks

Abstract

Memory constraint of always-on devices is one of the major concerns when deploying speech processing models on these devices. While larger models trained with sufficiently large amount of data generally perform better, making them fit in the device memory is a demanding challenge. In this paper, we aim to reduce model size by reparameterizing model weights across Transformer encoder layers and assuming a special weight composition and structure. More specifically, inspired by ResNet and the more recent LoRA work, we propose an approach named ResidualTransformer, where each weight matrix in a Transformer layer comprises 1) a shared full-rank component with its adjacent layers, and 2) a unique low-rank component to itself. The low-rank matrices only account for a small amount of model size increase. In addition, we add diagonal weight matrices to improve modeling capacity of the low-rank…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing

MethodsMulti-Head Attention · Average Pooling · Kaiming Initialization · 1x1 Convolution · Batch Normalization · *Communicated@Fast*How Do I Communicate to Expedia? · Convolution · Dense Connections · Linear Layer · Label Smoothing