Learning on Transformers is Provable Low-Rank and Sparse: A One-layer Analysis
Hongkang Li, Meng Wang, Shuai Zhang, Sijia Liu, Pin-Yu Chen

TL;DR
This paper provides the first theoretical analysis showing that trained one-layer Transformers exhibit low-rank and sparse properties, explaining the effectiveness of pruning and low-rank adaptation in large models.
Contribution
It characterizes the low-rank and sparsity properties of one-layer Transformers after training, offering theoretical insights into pruning and adaptation methods.
Findings
Gradient updates are low-rank depending on label-relevant patterns.
Proper magnitude-based pruning has minimal impact on test performance.
Numerical experiments support the theoretical analysis.
Abstract
Efficient training and inference algorithms, such as low-rank adaption and model pruning, have shown impressive performance for learning Transformer-based large foundation models. However, due to the technical challenges of the non-convex optimization caused by the complicated architecture of Transformers, the theoretical study of why these methods can be applied to learn Transformers is mostly elusive. To the best of our knowledge, this paper shows the first theoretical analysis of the property of low-rank and sparsity of one-layer Transformers by characterizing the trained model after convergence using stochastic gradient descent. By focusing on a data model based on label-relevant and label-irrelevant patterns, we quantify that the gradient updates of trainable parameters are low-rank, which depends on the number of label-relevant patterns. We also analyze how model pruning affects…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNeural Networks and Applications
MethodsPruning
