Conv2Former: A Simple Transformer-Style ConvNet for Visual Recognition

Qibin Hou; Cheng-Ze Lu; Ming-Ming Cheng; Jiashi Feng

arXiv:2211.11943·cs.CV·November 23, 2022·72 cites

Conv2Former: A Simple Transformer-Style ConvNet for Visual Recognition

Qibin Hou, Cheng-Ze Lu, Ming-Ming Cheng, Jiashi Feng

PDF

Open Access 2 Repos

TL;DR

Conv2Former introduces a simplified convolutional approach that effectively encodes spatial features, outperforming existing ConvNets and Vision Transformers across multiple visual recognition tasks.

Contribution

The paper proposes a novel convolutional modulation method that simplifies self-attention, enabling the design of hierarchical ConvNets that leverage large kernels more effectively.

Findings

01

Outperforms Swin Transformer and ConvNeXt on ImageNet classification

02

Achieves better results in COCO object detection

03

Excels in ADE20k semantic segmentation

Abstract

This paper does not attempt to design a state-of-the-art method for visual recognition but investigates a more efficient way to make use of convolutions to encode spatial features. By comparing the design principles of the recent convolutional neural networks ConvNets) and Vision Transformers, we propose to simplify the self-attention by leveraging a convolutional modulation operation. We show that such a simple approach can better take advantage of the large kernels (>=7x7) nested in convolutional layers. We build a family of hierarchical ConvNets using the proposed convolutional modulation, termed Conv2Former. Our network is simple and easy to follow. Experiments show that our Conv2Former outperforms existent popular ConvNets and vision Transformers, like Swin Transformer and ConvNeXt in all ImageNet classification, COCO object detection and ADE20k semantic segmentation.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning

MethodsMulti-Head Attention · Attention Is All You Need · ConvNeXt · Stochastic Depth · Layer Normalization · Adam · Linear Layer · Dense Connections · Residual Connection · Byte Pair Encoding