ConvNeXt V2: Co-designing and Scaling ConvNets with Masked Autoencoders
Sanghyun Woo, Shoubhik Debnath, Ronghang Hu, Xinlei Chen, Zhuang Liu,, In So Kweon, Saining Xie

TL;DR
ConvNeXt V2 introduces a co-designed convolutional autoencoder framework with a novel normalization layer, significantly boosting ConvNet performance across multiple visual recognition benchmarks.
Contribution
The paper presents a new ConvNeXt V2 architecture with a fully convolutional masked autoencoder and GRN layer, enhancing self-supervised learning and architectural design for superior accuracy.
Findings
Achieves 88.9% ImageNet top-1 accuracy with a 650M model.
Outperforms previous ConvNet models on COCO and ADE20K benchmarks.
Provides diverse pre-trained models from efficient to large-scale.
Abstract
Driven by improved architectures and better representation learning frameworks, the field of visual recognition has enjoyed rapid modernization and performance boost in the early 2020s. For example, modern ConvNets, represented by ConvNeXt, have demonstrated strong performance in various scenarios. While these models were originally designed for supervised learning with ImageNet labels, they can also potentially benefit from self-supervised learning techniques such as masked autoencoders (MAE). However, we found that simply combining these two approaches leads to subpar performance. In this paper, we propose a fully convolutional masked autoencoder framework and a new Global Response Normalization (GRN) layer that can be added to the ConvNeXt architecture to enhance inter-channel feature competition. This co-design of self-supervised learning techniques and architectural improvement…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗facebook/convnextv2-large-22k-384model· 189 dl· ♡ 3189 dl♡ 3
- 🤗timm/convnextv2_atto.fcmaemodel· 554 dl554 dl
- 🤗timm/convnextv2_atto.fcmae_ft_in1kmodel· 19k dl19k dl
- 🤗timm/convnextv2_base.fcmaemodel· 2.3k dl· ♡ 12.3k dl♡ 1
- 🤗timm/convnextv2_base.fcmae_ft_in1kmodel· 1.2k dl1.2k dl
- 🤗timm/convnextv2_base.fcmae_ft_in22k_in1kmodel· 112k dl· ♡ 3112k dl♡ 3
- 🤗timm/convnextv2_base.fcmae_ft_in22k_in1k_384model· 21k dl21k dl
- 🤗timm/convnextv2_femto.fcmaemodel· 152 dl152 dl
- 🤗timm/convnextv2_femto.fcmae_ft_in1kmodel· 10k dl10k dl
- 🤗timm/convnextv2_huge.fcmaemodel· 112 dl112 dl
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDomain Adaptation and Few-Shot Learning · Advanced Neural Network Applications · Multimodal Machine Learning Applications
MethodsConvNeXt
