SCSC: Spatial Cross-scale Convolution Module to Strengthen both CNNs and   Transformers

Xijun Wang; Xiaojie Chu; Chunrui Han; Xiangyu Zhang

arXiv:2308.07110·cs.CV·August 15, 2023

SCSC: Spatial Cross-scale Convolution Module to Strengthen both CNNs and Transformers

Xijun Wang, Xiaojie Chu, Chunrui Han, Xiangyu Zhang

PDF

Open Access

TL;DR

This paper introduces the Spatial Cross-scale Convolution (SCSC) module that enhances both CNNs and Transformers by capturing diverse features efficiently, leading to improved performance with fewer computational resources.

Contribution

The paper proposes a novel SCSC module that effectively improves CNNs and Transformers by capturing multi-scale features with reduced complexity.

Findings

01

FaceResNet with SCSC improves accuracy by 2.7% with 68% fewer FLOPs.

02

Swin Transformer with SCSC achieves better performance with 22% fewer FLOPs.

03

ResNet with SCSC improves by 5.3% with similar complexity.

Abstract

This paper presents a module, Spatial Cross-scale Convolution (SCSC), which is verified to be effective in improving both CNNs and Transformers. Nowadays, CNNs and Transformers have been successful in a variety of tasks. Especially for Transformers, increasing works achieve state-of-the-art performance in the computer vision community. Therefore, researchers start to explore the mechanism of those architectures. Large receptive fields, sparse connections, weight sharing, and dynamic weight have been considered keys to designing effective base models. However, there are still some issues to be addressed: large dense kernels and self-attention are inefficient, and large receptive fields make it hard to capture local features. Inspired by the above analyses and to solve the mentioned problems, in this paper, we design a general module taking in these design keys to enhance both CNNs and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsFace recognition and analysis · Video Surveillance and Tracking Methods · Face and Expression Recognition

Methods*Communicated@Fast*How Do I Communicate to Expedia? · Multi-Head Attention · Attention Is All You Need · Linear Layer · 1x1 Convolution · Absolute Position Encodings · Layer Normalization · Stochastic Depth · Adam · Softmax