Estimating the Effective Rank of Vision Transformers via Low-Rank Factorization
Liyu Zerihun

TL;DR
This paper proposes a method to estimate the intrinsic low-rank structure of vision transformers by training low-rank approximations and analyzing their performance, revealing the effective rank and knee point for model compression.
Contribution
It introduces a novel framework for estimating a model's intrinsic dimensionality using low-rank factorization and distillation, applicable across architectures and datasets.
Findings
Effective rank region for ViT-B/32 on CIFAR-100 is approximately [16, 34].
At rank 32, the model achieves about 95% of teacher accuracy with fewer parameters.
The framework provides a practical tool for characterizing the intrinsic dimensionality of deep models.
Abstract
Deep networks are heavily over-parameterized, yet their learned representations often admit low-rank structure. We introduce a framework for estimating a model's intrinsic dimensionality by treating learned representations as projections onto a low-rank subspace of the model's full capacity. Our approach: train a full-rank teacher, factorize its weights at multiple ranks, and train each factorized student via distillation to measure performance as a function of rank. We define effective rank as a region, not a point: the smallest contiguous set of ranks for which the student reaches 85-95% of teacher accuracy. To stabilize estimates, we fit accuracy vs. rank with a monotone PCHIP interpolant and identify crossings of the normalized curve. We also define the effective knee as the rank maximizing perpendicular distance between the smoothed accuracy curve and its endpoint secant; an…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Domain Adaptation and Few-Shot Learning · Stochastic Gradient Optimization Techniques
