Learning Diverse Features in Vision Transformers for Improved Generalization
Armand Mihai Nicolicioiu, Andrei Liviu Nicolicioiu, Bogdan Alexe,, Damien Teney

TL;DR
This paper investigates how vision transformers learn features, identifies the role of attention heads in capturing robust and spurious signals, and proposes methods to improve generalization by promoting feature diversity and pruning spurious heads.
Contribution
It introduces a technique to enhance feature diversity in ViTs by encouraging orthogonality of attention heads' gradients and demonstrates improved out-of-distribution performance.
Findings
Pruning spurious attention heads improves robustness.
Orthogonality of attention head gradients increases feature diversity.
Enhanced feature diversity leads to better OOD generalization.
Abstract
Deep learning models often rely only on a small set of features even when there is a rich set of predictive signals in the training data. This makes models brittle and sensitive to distribution shifts. In this work, we first examine vision transformers (ViTs) and find that they tend to extract robust and spurious features with distinct attention heads. As a result of this modularity, their performance under distribution shifts can be significantly improved at test time by pruning heads corresponding to spurious features, which we demonstrate using an "oracle selection" on validation data. Second, we propose a method to further enhance the diversity and complementarity of the learned features by encouraging orthogonality of the attention heads' input gradients. We observe improved out-of-distribution performance on diagnostic benchmarks (MNIST-CIFAR, Waterbirds) as a consequence of the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Domain Adaptation and Few-Shot Learning · Neural Networks and Applications
MethodsPruning
