HSViT: Horizontally Scalable Vision Transformer
Chenhao Xu, Chang-Tsun Li, Chee Peng Lim, Douglas Creighton

TL;DR
HSViT introduces a horizontally scalable vision transformer with a new image-level feature embedding, eliminating pre-training needs and enabling collaborative training, achieving higher accuracy on small datasets and ImageNet.
Contribution
The paper proposes a novel HSViT architecture with image-level feature embedding and horizontal scalability, reducing pre-training dependency and improving performance across datasets.
Findings
Achieves up to 10% higher accuracy on small datasets without pre-training.
Provides up to 3.1% accuracy improvement on ImageNet over CNN backbones.
Enables collaborative training across multiple devices.
Abstract
Due to its deficiency in prior knowledge (inductive bias), Vision Transformer (ViT) requires pre-training on large-scale datasets to perform well. Moreover, the growing layers and parameters in ViT models impede their applicability to devices with limited computing resources. To mitigate the aforementioned challenges, this paper introduces a novel horizontally scalable vision transformer (HSViT) scheme. Specifically, a novel image-level feature embedding is introduced to ViT, where the preserved inductive bias allows the model to eliminate the need for pre-training while outperforming on small datasets. Besides, a novel horizontally scalable architecture is designed, facilitating collaborative model training and inference across multiple computing devices. The experimental results depict that, without pre-training, HSViT achieves up to 10% higher top-1 accuracy than state-of-the-art…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCCD and CMOS Imaging Sensors · Image Processing Techniques and Applications · Advanced Vision and Imaging
MethodsAttention Is All You Need · Softmax · Linear Layer · Layer Normalization · Dense Connections · Label Smoothing · Vision Transformer · Residual Connection · Dropout · Multi-Head Attention
