MVFormer: Diversifying Feature Normalization and Token Mixing for   Efficient Vision Transformers

Jongseong Bae; Susang Kim; Minsu Cho; Ha Young Kim

arXiv:2411.18995·cs.CV·December 2, 2024

MVFormer: Diversifying Feature Normalization and Token Mixing for Efficient Vision Transformers

Jongseong Bae, Susang Kim, Minsu Cho, Ha Young Kim

PDF

Open Access

TL;DR

MVFormer introduces multi-view normalization and token mixing to diversify feature learning, significantly improving efficiency and accuracy of vision transformers across multiple vision tasks.

Contribution

The paper proposes MVN and MVTM components integrated into a new ViT model, MVFormer, enhancing feature diversity and multi-scale token interaction for better performance.

Findings

01

Outperforms state-of-the-art convolution-based ViTs on multiple vision tasks.

02

Achieves high accuracy on ImageNet-1K with fewer parameters and MACs.

03

Demonstrates the effectiveness of diversified normalization and token mixing strategies.

Abstract

Active research is currently underway to enhance the efficiency of vision transformers (ViTs). Most studies have focused solely on effective token mixers, overlooking the potential relationship with normalization. To boost diverse feature learning, we propose two components: a normalization module called multi-view normalization (MVN) and a token mixer called multi-view token mixer (MVTM). The MVN integrates three differently normalized features via batch, layer, and instance normalization using a learnable weighted sum. Each normalization method outputs a different distribution, generating distinct features. Thus, the MVN is expected to offer diverse pattern information to the token mixer, resulting in beneficial synergy. The MVTM is a convolution-based multiscale token mixer with local, intermediate, and global filters, and it incorporates stage specificity by configuring various…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsCCD and CMOS Imaging Sensors

MethodsMetaFormer · Instance Normalization