ConViT: Improving Vision Transformers with Soft Convolutional Inductive   Biases

St\'ephane d'Ascoli; Hugo Touvron; Matthew Leavitt; Ari Morcos; Giulio; Biroli; Levent Sagun

arXiv:2103.10697·cs.CV·December 7, 2022·57 cites

ConViT: Improving Vision Transformers with Soft Convolutional Inductive Biases

St\'ephane d'Ascoli, Hugo Touvron, Matthew Leavitt, Ari Morcos, Giulio, Biroli, Levent Sagun

PDF

Open Access 5 Repos 4 Models 1 Video

TL;DR

ConViT introduces a hybrid vision transformer architecture with soft convolutional biases, enhancing sample efficiency and performance by combining the strengths of CNNs and ViTs through gated positional self-attention.

Contribution

The paper proposes GPSA, a novel attention mechanism that incorporates a soft convolutional bias into ViTs, enabling improved performance and sample efficiency without extensive pre-training.

Findings

01

ConViT outperforms DeiT on ImageNet classification.

02

GPSA effectively balances locality and content in attention layers.

03

The architecture improves sample efficiency over traditional ViTs.

Abstract

Convolutional architectures have proven extremely successful for vision tasks. Their hard inductive biases enable sample-efficient learning, but come at the cost of a potentially lower performance ceiling. Vision Transformers (ViTs) rely on more flexible self-attention layers, and have recently outperformed CNNs for image classification. However, they require costly pre-training on large external datasets or distillation from pre-trained convolutional networks. In this paper, we ask the following question: is it possible to combine the strengths of these two architectures while avoiding their respective limitations? To this end, we introduce gated positional self-attention (GPSA), a form of positional self-attention which can be equipped with a ``soft" convolutional inductive bias. We initialise the GPSA layers to mimic the locality of convolutional layers, then give each attention head…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Videos

ConViT: Improving Vision Transformers with Soft Convolutional Inductive Biases· slideslive

Taxonomy

TopicsAdvanced Neural Network Applications · Domain Adaptation and Few-Shot Learning · Advanced Memory and Neural Computing

MethodsLinear Layer · Residual Connection · Layer Normalization · Gated Positional Self-Attention · ConViT · Dropout · Multi-Head Attention · Dense Connections · Feedforward Network · Softmax