Depth-Wise Convolutions in Vision Transformers for Efficient Training on   Small Datasets

Tianxiao Zhang; Wenju Xu; Bo Luo; Guanghui Wang

arXiv:2407.19394·cs.CV·January 17, 2025·2 cites

Depth-Wise Convolutions in Vision Transformers for Efficient Training on Small Datasets

Tianxiao Zhang, Wenju Xu, Bo Luo, Guanghui Wang

PDF

Open Access 1 Repo

TL;DR

This paper introduces a lightweight Depth-Wise Convolution module in Vision Transformers to better capture local details, significantly improving performance on small datasets across various vision tasks.

Contribution

It proposes a novel Depth-Wise Convolution module as a shortcut in ViT models, enhancing local feature learning and efficiency, especially for small datasets.

Findings

01

Significant performance improvements on small datasets

02

Effective capture of local and global information

03

Reduced model parameters with architecture variants

Abstract

The Vision Transformer (ViT) leverages the Transformer's encoder to capture global information by dividing images into patches and achieves superior performance across various computer vision tasks. However, the self-attention mechanism of ViT captures the global context from the outset, overlooking the inherent relationships between neighboring pixels in images or videos. Transformers mainly focus on global information while ignoring the fine-grained local details. Consequently, ViT lacks inductive bias during image or video dataset training. In contrast, convolutional neural networks (CNNs), with their reliance on local filters, possess an inherent inductive bias, making them more efficient and quicker to converge than ViT with less data. In this paper, we present a lightweight Depth-Wise Convolution module as a shortcut in ViT models, bypassing entire Transformer blocks to ensure the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ztx-100/efficient_vit_with_dw
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications

MethodsAttention Is All You Need · Label Smoothing · Adam · Linear Layer · Byte Pair Encoding · Convolution · Layer Normalization · Softmax · Position-Wise Feed-Forward Layer · Dense Connections