Aggregated Pyramid Vision Transformer: Split-transform-merge Strategy   for Image Recognition without Convolutions

Rui-Yang Ju; Ting-Yu Lin; Jen-Shiun Chiang; Jia-Hao Jian; Yu-Shian; Lin; and Liu-Rui-Yi Huang

arXiv:2203.00960·cs.CV·March 3, 2022

Aggregated Pyramid Vision Transformer: Split-transform-merge Strategy for Image Recognition without Convolutions

Rui-Yang Ju, Ting-Yu Lin, Jen-Shiun Chiang, Jia-Hao Jian, Yu-Shian, Lin, and Liu-Rui-Yi Huang

PDF

Open Access

TL;DR

This paper introduces the Aggregated Pyramid Vision Transformer (APVT), a novel architecture combining pyramid structure and split-transform-merge strategy, achieving efficient image recognition without convolutions.

Contribution

It proposes a new Transformer-based architecture with a pyramid structure and split-transform-merge strategy, improving performance and reducing computational cost in vision tasks.

Findings

01

APVT achieves state-of-the-art results on CIFAR-10 and COCO 2017 datasets.

02

The model reduces computational costs compared to other Transformer-based architectures.

03

APVT demonstrates strong performance in image classification and object detection.

Abstract

With the achievements of Transformer in the field of natural language processing, the encoder-decoder and the attention mechanism in Transformer have been applied to computer vision. Recently, in multiple tasks of computer vision (image classification, object detection, semantic segmentation, etc.), state-of-the-art convolutional neural networks have introduced some concepts of Transformer. This proves that Transformer has a good prospect in the field of image recognition. After Vision Transformer was proposed, more and more works began to use self-attention to completely replace the convolutional layer. This work is based on Vision Transformer, combined with the pyramid architecture, using Split-transform-merge to propose the group encoder and name the network architecture Aggregated Pyramid Vision Transformer (APVT). We perform image classification tasks on the CIFAR-10 dataset and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Brain Tumor Detection and Classification · COVID-19 diagnosis using AI

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Absolute Position Encodings · Byte Pair Encoding · Position-Wise Feed-Forward Layer · Residual Connection · Layer Normalization · Label Smoothing · Dropout