Adaptive Attention Link-based Regularization for Vision Transformers
Heegon Jin, Jongwon Choi

TL;DR
This paper introduces an adaptive regularization method using trainable links between CNN and ViT attention mechanisms, enhancing training efficiency and performance of Vision Transformers with limited data.
Contribution
It proposes a novel attention augmentation module that improves ViT training efficiency and performance, especially with small datasets, by leveraging CNN attention relationships.
Findings
Improves ViT performance with limited data
Accelerates convergence during training
Reduces overfitting in Vision Transformers
Abstract
Although transformer networks are recently employed in various vision tasks with outperforming performance, extensive training data and a lengthy training time are required to train a model to disregard an inductive bias. Using trainable links between the channel-wise spatial attention of a pre-trained Convolutional Neural Network (CNN) and the attention head of Vision Transformers (ViT), we present a regularization technique to improve the training efficiency of ViT. The trainable links are referred to as the attention augmentation module, which is trained simultaneously with ViT, boosting the training of ViT and allowing it to avoid the overfitting issue caused by a lack of data. From the trained attention augmentation module, we can extract the relevant relationship between each CNN activation map and each ViT attention head, and based on this, we also propose an advanced attention…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Currency Recognition and Detection · Neural Networks and Applications
