Hide & Seek: Transformer Symmetries Obscure Sharpness & Riemannian Geometry Finds It
Marvin F. da Silva, Felix Dangel, Sageev Oore

TL;DR
This paper introduces a symmetry-aware sharpness measure for transformers using Riemannian geometry, which better correlates with their generalization performance than previous measures.
Contribution
It redefines sharpness on a quotient manifold to account for transformer symmetries, improving the understanding of model generalization.
Findings
Geodesic sharpness correlates strongly with generalization in transformers.
Higher-order approximations of geodesics improve sharpness's predictive power.
The method applies to both synthetic and real-world transformer models.
Abstract
The concept of sharpness has been successfully applied to traditional architectures like MLPs and CNNs to predict their generalization. For transformers, however, recent work reported weak correlation between flatness and generalization. We argue that existing sharpness measures fail for transformers, because they have much richer symmetries in their attention mechanism that induce directions in parameter space along which the network or its loss remain identical. We posit that sharpness must account fully for these symmetries, and thus we redefine it on a quotient manifold that results from quotienting out the transformer symmetries, thereby removing their ambiguities. Leveraging tools from Riemannian geometry, we propose a fully general notion of sharpness, in terms of a geodesic ball on the symmetry-corrected quotient manifold. In practice, we need to resort to approximating the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Face Recognition and Perception · Face recognition and analysis
MethodsSoftmax · Attention Is All You Need
