On Rank-Dependent Generalisation Error Bounds for Transformers

Lan V. Truong

arXiv:2410.11500·stat.ML·October 16, 2024

On Rank-Dependent Generalisation Error Bounds for Transformers

Lan V. Truong

PDF

Open Access

TL;DR

This paper derives new rank-dependent generalization error bounds for single layer transformers, showing that low-rank matrix constraints can improve error decay rates independently of sequence length.

Contribution

Introduces novel covering number bounds based on matrix rank and applies them to derive improved generalization bounds for transformers.

Findings

01

Error bound decays as O(1/√n), better than previous O((log n)/√n)

02

Error bound depends logarithmically on matrix rank r_w

03

Bounds are independent of input sequence length

Abstract

In this paper, we introduce various covering number bounds for linear function classes, each subject to different constraints on input and matrix norms. These bounds are contingent on the rank of each class of matrices. We then apply these bounds to derive generalization errors for single layer transformers. Our results improve upon several existing generalization bounds in the literature and are independent of input sequence length, highlighting the advantages of employing low-rank matrices in transformer design. More specifically, our achieved generalisation error bound decays as $O (1/ n)$ where $n$ is the sample length, which improves existing results in research literature of the order $O ((lo g n) / (n))$ . It also decays as $O (lo g r_{w})$ where $r_{w}$ is the rank of the combination of query and and key matrices.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNeural Networks and Applications · Matrix Theory and Algorithms · Control Systems and Identification