Depth-Wise Convolutions in Vision Transformers for Efficient Training on Small Datasets
Tianxiao Zhang, Wenju Xu, Bo Luo, Guanghui Wang

TL;DR
This paper introduces a lightweight Depth-Wise Convolution module in Vision Transformers to better capture local details, significantly improving performance on small datasets across various vision tasks.
Contribution
It proposes a novel Depth-Wise Convolution module as a shortcut in ViT models, enhancing local feature learning and efficiency, especially for small datasets.
Findings
Significant performance improvements on small datasets
Effective capture of local and global information
Reduced model parameters with architecture variants
Abstract
The Vision Transformer (ViT) leverages the Transformer's encoder to capture global information by dividing images into patches and achieves superior performance across various computer vision tasks. However, the self-attention mechanism of ViT captures the global context from the outset, overlooking the inherent relationships between neighboring pixels in images or videos. Transformers mainly focus on global information while ignoring the fine-grained local details. Consequently, ViT lacks inductive bias during image or video dataset training. In contrast, convolutional neural networks (CNNs), with their reliance on local filters, possess an inherent inductive bias, making them more efficient and quicker to converge than ViT with less data. In this paper, we present a lightweight Depth-Wise Convolution module as a shortcut in ViT models, bypassing entire Transformer blocks to ensure the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications
MethodsAttention Is All You Need · Label Smoothing · Adam · Linear Layer · Byte Pair Encoding · Convolution · Layer Normalization · Softmax · Position-Wise Feed-Forward Layer · Dense Connections
