HSViT: Horizontally Scalable Vision Transformer

Chenhao Xu; Chang-Tsun Li; Chee Peng Lim; Douglas Creighton

arXiv:2404.05196·cs.CV·July 17, 2024·2 cites

HSViT: Horizontally Scalable Vision Transformer

Chenhao Xu, Chang-Tsun Li, Chee Peng Lim, Douglas Creighton

PDF

Open Access 1 Repo

TL;DR

HSViT introduces a horizontally scalable vision transformer with a new image-level feature embedding, eliminating pre-training needs and enabling collaborative training, achieving higher accuracy on small datasets and ImageNet.

Contribution

The paper proposes a novel HSViT architecture with image-level feature embedding and horizontal scalability, reducing pre-training dependency and improving performance across datasets.

Findings

01

Achieves up to 10% higher accuracy on small datasets without pre-training.

02

Provides up to 3.1% accuracy improvement on ImageNet over CNN backbones.

03

Enables collaborative training across multiple devices.

Abstract

Due to its deficiency in prior knowledge (inductive bias), Vision Transformer (ViT) requires pre-training on large-scale datasets to perform well. Moreover, the growing layers and parameters in ViT models impede their applicability to devices with limited computing resources. To mitigate the aforementioned challenges, this paper introduces a novel horizontally scalable vision transformer (HSViT) scheme. Specifically, a novel image-level feature embedding is introduced to ViT, where the preserved inductive bias allows the model to eliminate the need for pre-training while outperforming on small datasets. Besides, a novel horizontally scalable architecture is designed, facilitating collaborative model training and inference across multiple computing devices. The experimental results depict that, without pre-training, HSViT achieves up to 10% higher top-1 accuracy than state-of-the-art…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

xuchenhao001/hsvit
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsCCD and CMOS Imaging Sensors · Image Processing Techniques and Applications · Advanced Vision and Imaging

MethodsAttention Is All You Need · Softmax · Linear Layer · Layer Normalization · Dense Connections · Label Smoothing · Vision Transformer · Residual Connection · Dropout · Multi-Head Attention