Toward a Deeper Understanding: RetNet Viewed through Convolution

Chenghao Li; Chaoning Zhang

arXiv:2309.05375·cs.CV·October 31, 2023

Toward a Deeper Understanding: RetNet Viewed through Convolution

Chenghao Li, Chaoning Zhang

PDF

Open Access 1 Repo

TL;DR

This paper introduces a Gaussian mixture mask to enhance Vision Transformers' local modeling capabilities with minimal additional parameters, demonstrating improved performance on small datasets.

Contribution

It proposes a novel Gaussian mixture mask for ViT attention mechanisms, reducing parameter overhead while boosting local modeling effectiveness.

Findings

01

Gaussian mask improves ViT performance on small datasets

02

Minimal additional parameters and computational cost

03

Effective local modeling enhancement

Abstract

The success of Vision Transformer (ViT) has been widely reported on a wide range of image recognition tasks. ViT can learn global dependencies superior to CNN, yet CNN's inherent locality can substitute for expensive training resources. Recently, the outstanding performance of RetNet in the field of language modeling has garnered attention, surpassing that of the Transformer with explicit local modeling, shifting researchers' focus towards Transformers in the CV field. This paper investigates the effectiveness of RetNet from a CNN perspective and presents a variant of RetNet tailored to the visual domain. Similar to RetNet we improves ViT's local modeling by applying a weight mask on the original self-attention matrix. A straightforward way to locally adapt the self-attention matrix can be realized by an element-wise learnable weight mask (ELM), for which our preliminary results show…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

catworldlee/gaussian-mixture-mask-attention
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Neural Network Applications · Domain Adaptation and Few-Shot Learning

MethodsMulti-Head Attention · Attention Is All You Need · Focus · Adam · Byte Pair Encoding · Softmax · Dropout · Label Smoothing · Absolute Position Encodings · Layer Normalization