Small Singular Values Matter: A Random Matrix Analysis of Transformer Models
Max Staats, Matthias Thamm, Bernd Rosenow

TL;DR
This paper investigates the spectral properties of weight matrices in transformer models using Random Matrix Theory, revealing that small singular values carry significant information and impact model performance and compression strategies.
Contribution
It introduces a novel analysis of small singular values in transformers, showing their importance and providing a theoretical model and practical insights for model pruning.
Findings
Small singular values deviate from RMT, indicating learned information.
Zeroing out small singular values increases perplexity more than removing large ones.
Fine-tuning can make small singular values highly influential.
Abstract
This work analyzes singular-value spectra of weight matrices in pretrained transformer models to understand how information is stored at both ends of the spectrum. Using Random Matrix Theory (RMT) as a zero information hypothesis, we associate agreement with RMT as evidence of randomness and deviations as evidence for learning. Surprisingly, we observe pronounced departures from RMT not only among the largest singular values -- the usual outliers -- but also among the smallest ones. A comparison of the associated singular vectors with the eigenvectors of the activation covariance matrices shows that there is considerable overlap wherever RMT is violated. Thus, significant directions in the data are captured by small singular values and their vectors as well as by the large ones. We confirm this empirically: zeroing out the singular values that deviate from RMT raises language-model…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Speech Recognition and Synthesis
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Dropout · Dense Connections · Layer Normalization · Residual Connection · Linear Warmup With Linear Decay · Weight Decay · Adam · Attention Dropout
