Spectrum-Adaptive Generalization Bounds for Trained Deep Transformers
Mana Sakai, Masaaki Imaizumi

TL;DR
This paper introduces spectrum-adaptive generalization bounds for trained deep Transformers, which depend on the spectral properties of weight matrices and adapt post-training, offering insights into their generalization behavior.
Contribution
The paper derives novel spectrum-adaptive bounds for Transformers that can be chosen after training, improving upon fixed norm-based bounds by considering spectral profiles.
Findings
Bounds grow more slowly with depth and hidden dimension than norm-based proxies.
Spectral structure of trained Transformers influences their generalization properties.
Empirical results support the effectiveness of spectral proxies over traditional norm-based measures.
Abstract
Understanding why trained Transformers generalize well is a fundamental problem in modern machine learning theory, and complexity-based generalization bounds provide a principled way to study this question. While existing norm-based bounds for Transformers remove the explicit polynomial dependence on the hidden dimension, they typically impose fixed norm constraints specified a priori and can exhibit unfavorable exponential dependence on depth. In this paper, we derive spectrum-adaptive post hoc generalization bounds for multi-layer Transformers. Under layerwise spectral norm control, the bounds are expressed in terms of layerwise Schatten quantities of the query-key, value, and feedforward weight matrices. Since the Schatten indices need not be fixed a priori and can instead be selected after training, separately for each matrix type and layer, the bounds adaptively trade off spectral…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
