TL;DR
Co-Scale Conv-Attentional Image Transformers (CoaT) introduce co-scale and conv-attentional mechanisms to enhance multi-scale and contextual modeling in image classification, achieving superior results on ImageNet and downstream tasks.
Contribution
The paper proposes novel co-scale and conv-attentional mechanisms that improve multi-scale and contextual understanding in Transformer-based image classifiers.
Findings
CoaT models outperform similar-sized CNNs and Transformers on ImageNet.
CoaT's backbone improves object detection and segmentation performance.
Efficient multi-scale and contextual modeling enhances classification accuracy.
Abstract
In this paper, we present Co-scale conv-attentional image Transformers (CoaT), a Transformer-based image classifier equipped with co-scale and conv-attentional mechanisms. First, the co-scale mechanism maintains the integrity of Transformers' encoder branches at individual scales, while allowing representations learned at different scales to effectively communicate with each other; we design a series of serial and parallel blocks to realize the co-scale mechanism. Second, we devise a conv-attentional mechanism by realizing a relative position embedding formulation in the factorized attention module with an efficient convolution-like implementation. CoaT empowers image Transformers with enriched multi-scale and contextual modeling capabilities. On ImageNet, relatively small CoaT models attain superior classification results compared with similar-sized convolutional neural networks and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗kadirnar/timm_model_listmodel· ♡ 1♡ 1
- 🤗timm/coat_lite_medium.in1kmodel· 258 dl258 dl
- 🤗timm/coat_lite_medium_384.in1kmodel· 120 dl120 dl
- 🤗timm/coat_lite_mini.in1kmodel· 1.4k dl1.4k dl
- 🤗timm/coat_lite_small.in1kmodel· 631 dl631 dl
- 🤗timm/coat_lite_tiny.in1kmodel· 180 dl180 dl
- 🤗timm/coat_mini.in1kmodel· 175 dl175 dl
- 🤗timm/coat_small.in1kmodel· 152 dl152 dl
- 🤗timm/coat_tiny.in1kmodel· 1.2k dl1.2k dl
- 🤗litert-community/coat_lite_tinymodel· 20 dl20 dl
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsCo-Scale Conv-attentional Image Transformer
