Co-Scale Conv-Attentional Image Transformers

Weijian Xu; Yifan Xu; Tyler Chang; Zhuowen Tu

arXiv:2104.06399·cs.CV·August 27, 2021

Co-Scale Conv-Attentional Image Transformers

Weijian Xu, Yifan Xu, Tyler Chang, Zhuowen Tu

PDF

5 Repos 10 Models

TL;DR

Co-Scale Conv-Attentional Image Transformers (CoaT) introduce co-scale and conv-attentional mechanisms to enhance multi-scale and contextual modeling in image classification, achieving superior results on ImageNet and downstream tasks.

Contribution

The paper proposes novel co-scale and conv-attentional mechanisms that improve multi-scale and contextual understanding in Transformer-based image classifiers.

Findings

01

CoaT models outperform similar-sized CNNs and Transformers on ImageNet.

02

CoaT's backbone improves object detection and segmentation performance.

03

Efficient multi-scale and contextual modeling enhances classification accuracy.

Abstract

In this paper, we present Co-scale conv-attentional image Transformers (CoaT), a Transformer-based image classifier equipped with co-scale and conv-attentional mechanisms. First, the co-scale mechanism maintains the integrity of Transformers' encoder branches at individual scales, while allowing representations learned at different scales to effectively communicate with each other; we design a series of serial and parallel blocks to realize the co-scale mechanism. Second, we devise a conv-attentional mechanism by realizing a relative position embedding formulation in the factorized attention module with an efficient convolution-like implementation. CoaT empowers image Transformers with enriched multi-scale and contextual modeling capabilities. On ImageNet, relatively small CoaT models attain superior classification results compared with similar-sized convolutional neural networks and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsCo-Scale Conv-attentional Image Transformer