Contextual Transformer Networks for Visual Recognition

Yehao Li; Ting Yao; Yingwei Pan; Tao Mei

arXiv:2107.12292·cs.CV·July 27, 2021·42 cites

Contextual Transformer Networks for Visual Recognition

Yehao Li, Ting Yao, Yingwei Pan, Tao Mei

PDF

Open Access 5 Repos

TL;DR

This paper introduces the Contextual Transformer (CoT) block, a novel module that enhances visual recognition by exploiting contextual information among input keys, leading to improved performance in various vision tasks.

Contribution

The paper proposes the CoT block, which fully utilizes contextual information among input keys to learn dynamic attention, replacing standard convolutions in ResNet architectures for better visual recognition.

Findings

01

CoTNet outperforms traditional CNN backbones in image recognition.

02

CoTNet improves object detection and segmentation results.

03

The CoT block can replace 3x3 convolutions in ResNet, enhancing feature representation.

Abstract

Transformer with self-attention has led to the revolutionizing of natural language processing field, and recently inspires the emergence of Transformer-style architecture design with competitive results in numerous computer vision tasks. Nevertheless, most of existing designs directly employ self-attention over a 2D feature map to obtain the attention matrix based on pairs of isolated queries and keys at each spatial location, but leave the rich contexts among neighbor keys under-exploited. In this work, we design a novel Transformer-style module, i.e., Contextual Transformer (CoT) block, for visual recognition. Such design fully capitalizes on the contextual information among input keys to guide the learning of dynamic attention matrix and thus strengthens the capacity of visual representation. Technically, CoT block first contextually encodes input keys via a $3 \times 3$ convolution,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Domain Adaptation and Few-Shot Learning · Visual Attention and Saliency Detection

MethodsAttention Is All You Need · *Communicated@Fast*How Do I Communicate to Expedia? · Linear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Batch Normalization · Average Pooling · Kaiming Initialization · Residual Block · Global Average Pooling