Toward a Deeper Understanding: RetNet Viewed through Convolution
Chenghao Li, Chaoning Zhang

TL;DR
This paper introduces a Gaussian mixture mask to enhance Vision Transformers' local modeling capabilities with minimal additional parameters, demonstrating improved performance on small datasets.
Contribution
It proposes a novel Gaussian mixture mask for ViT attention mechanisms, reducing parameter overhead while boosting local modeling effectiveness.
Findings
Gaussian mask improves ViT performance on small datasets
Minimal additional parameters and computational cost
Effective local modeling enhancement
Abstract
The success of Vision Transformer (ViT) has been widely reported on a wide range of image recognition tasks. ViT can learn global dependencies superior to CNN, yet CNN's inherent locality can substitute for expensive training resources. Recently, the outstanding performance of RetNet in the field of language modeling has garnered attention, surpassing that of the Transformer with explicit local modeling, shifting researchers' focus towards Transformers in the CV field. This paper investigates the effectiveness of RetNet from a CNN perspective and presents a variant of RetNet tailored to the visual domain. Similar to RetNet we improves ViT's local modeling by applying a weight mask on the original self-attention matrix. A straightforward way to locally adapt the self-attention matrix can be realized by an element-wise learnable weight mask (ELM), for which our preliminary results show…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Neural Network Applications · Domain Adaptation and Few-Shot Learning
MethodsMulti-Head Attention · Attention Is All You Need · Focus · Adam · Byte Pair Encoding · Softmax · Dropout · Label Smoothing · Absolute Position Encodings · Layer Normalization
