On Rank-Dependent Generalisation Error Bounds for Transformers
Lan V. Truong

TL;DR
This paper derives new rank-dependent generalization error bounds for single layer transformers, showing that low-rank matrix constraints can improve error decay rates independently of sequence length.
Contribution
Introduces novel covering number bounds based on matrix rank and applies them to derive improved generalization bounds for transformers.
Findings
Error bound decays as O(1/√n), better than previous O((log n)/√n)
Error bound depends logarithmically on matrix rank r_w
Bounds are independent of input sequence length
Abstract
In this paper, we introduce various covering number bounds for linear function classes, each subject to different constraints on input and matrix norms. These bounds are contingent on the rank of each class of matrices. We then apply these bounds to derive generalization errors for single layer transformers. Our results improve upon several existing generalization bounds in the literature and are independent of input sequence length, highlighting the advantages of employing low-rank matrices in transformer design. More specifically, our achieved generalisation error bound decays as where is the sample length, which improves existing results in research literature of the order . It also decays as where is the rank of the combination of query and and key matrices.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNeural Networks and Applications · Matrix Theory and Algorithms · Control Systems and Identification
