ConViT: Improving Vision Transformers with Soft Convolutional Inductive Biases
St\'ephane d'Ascoli, Hugo Touvron, Matthew Leavitt, Ari Morcos, Giulio, Biroli, Levent Sagun

TL;DR
ConViT introduces a hybrid vision transformer architecture with soft convolutional biases, enhancing sample efficiency and performance by combining the strengths of CNNs and ViTs through gated positional self-attention.
Contribution
The paper proposes GPSA, a novel attention mechanism that incorporates a soft convolutional bias into ViTs, enabling improved performance and sample efficiency without extensive pre-training.
Findings
ConViT outperforms DeiT on ImageNet classification.
GPSA effectively balances locality and content in attention layers.
The architecture improves sample efficiency over traditional ViTs.
Abstract
Convolutional architectures have proven extremely successful for vision tasks. Their hard inductive biases enable sample-efficient learning, but come at the cost of a potentially lower performance ceiling. Vision Transformers (ViTs) rely on more flexible self-attention layers, and have recently outperformed CNNs for image classification. However, they require costly pre-training on large external datasets or distillation from pre-trained convolutional networks. In this paper, we ask the following question: is it possible to combine the strengths of these two architectures while avoiding their respective limitations? To this end, we introduce gated positional self-attention (GPSA), a form of positional self-attention which can be equipped with a ``soft" convolutional inductive bias. We initialise the GPSA layers to mimic the locality of convolutional layers, then give each attention head…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsAdvanced Neural Network Applications · Domain Adaptation and Few-Shot Learning · Advanced Memory and Neural Computing
MethodsLinear Layer · Residual Connection · Layer Normalization · Gated Positional Self-Attention · ConViT · Dropout · Multi-Head Attention · Dense Connections · Feedforward Network · Softmax
