Weight decay induces low-rank attention layers
Seijin Kobayashi, Yassir Akram, Johannes Von Oswald

TL;DR
This paper investigates how weight decay influences the low-rank structure of attention layer matrices in neural networks, providing theoretical insights and empirical evidence that regularization induces low-rank matrices which can impact model performance.
Contribution
It extends theoretical understanding of how weight decay relates to low-rank regularization in multiplicative parameter matrices of attention layers, and empirically demonstrates this effect in training neural networks.
Findings
Weight decay induces low-rank structure in attention matrices.
Low-rank attention matrices can impair language model performance.
Decoupling weight decay in attention layers can improve training outcomes.
Abstract
The effect of regularizers such as weight decay when training deep neural networks is not well understood. We study the influence of weight decay as well as -regularization when training neural network models in which parameter matrices interact multiplicatively. This combination is of particular interest as this parametrization is common in attention layers, the workhorse of transformers. Here, key-query, as well as value-projection parameter matrices, are multiplied directly with each other: and . We extend previous results and show on one hand that any local minimum of a -regularized loss of the form coincides with a minimum of the nuclear norm-regularized loss , and on the other hand that the 2 losses become identical exponentially quickly during training. We thus complement…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsFunctional Brain Connectivity Studies · Neural dynamics and brain function · Visual Attention and Saliency Detection
MethodsSoftmax · Attention Is All You Need · Weight Decay
