Aggregated Pyramid Vision Transformer: Split-transform-merge Strategy for Image Recognition without Convolutions
Rui-Yang Ju, Ting-Yu Lin, Jen-Shiun Chiang, Jia-Hao Jian, Yu-Shian, Lin, and Liu-Rui-Yi Huang

TL;DR
This paper introduces the Aggregated Pyramid Vision Transformer (APVT), a novel architecture combining pyramid structure and split-transform-merge strategy, achieving efficient image recognition without convolutions.
Contribution
It proposes a new Transformer-based architecture with a pyramid structure and split-transform-merge strategy, improving performance and reducing computational cost in vision tasks.
Findings
APVT achieves state-of-the-art results on CIFAR-10 and COCO 2017 datasets.
The model reduces computational costs compared to other Transformer-based architectures.
APVT demonstrates strong performance in image classification and object detection.
Abstract
With the achievements of Transformer in the field of natural language processing, the encoder-decoder and the attention mechanism in Transformer have been applied to computer vision. Recently, in multiple tasks of computer vision (image classification, object detection, semantic segmentation, etc.), state-of-the-art convolutional neural networks have introduced some concepts of Transformer. This proves that Transformer has a good prospect in the field of image recognition. After Vision Transformer was proposed, more and more works began to use self-attention to completely replace the convolutional layer. This work is based on Vision Transformer, combined with the pyramid architecture, using Split-transform-merge to propose the group encoder and name the network architecture Aggregated Pyramid Vision Transformer (APVT). We perform image classification tasks on the CIFAR-10 dataset and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Brain Tumor Detection and Classification · COVID-19 diagnosis using AI
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Absolute Position Encodings · Byte Pair Encoding · Position-Wise Feed-Forward Layer · Residual Connection · Layer Normalization · Label Smoothing · Dropout
