CoMA: Complementary Masking and Hierarchical Dynamic Multi-Window Self-Attention in a Unified Pre-training Framework

Jiaxuan Li; Qing Xu; Xiangjian He; Ziyu Liu; Chang Xing; Zhen Chen; Daokun Zhang; Rong Qu; Chang Wen Chen

arXiv:2511.05929·cs.CV·November 11, 2025

CoMA: Complementary Masking and Hierarchical Dynamic Multi-Window Self-Attention in a Unified Pre-training Framework

Jiaxuan Li, Qing Xu, Xiangjian He, Ziyu Liu, Chang Xing, Zhen Chen, Daokun Zhang, Rong Qu, Chang Wen Chen

PDF

Open Access

TL;DR

CoMA introduces a complementary masking strategy and a hierarchical vision transformer with dynamic multi-window self-attention, significantly improving pre-training efficiency and adaptability in image representation learning.

Contribution

The paper proposes CoMA with a novel complementary masking approach and DyViT with dynamic multi-window self-attention, enhancing learning efficiency and model adaptability over existing MAE methods.

Findings

01

Pre-trained CoMA-DyViT matches MAE performance with only 12% of pre-training epochs.

02

DyViT reduces parameters and FLOPs while improving fine-grained feature learning.

03

Pre-training time per epoch is reduced by 10% with DyViT.

Abstract

Masked Autoencoders (MAE) achieve self-supervised learning of image representations by randomly removing a portion of visual tokens and reconstructing the original image as a pretext task, thereby significantly enhancing pretraining efficiency and yielding excellent adaptability across downstream tasks. However, MAE and other MAE-style paradigms that adopt random masking generally require more pre-training epochs to maintain adaptability. Meanwhile, ViT in MAE suffers from inefficient parameter use due to fixed spatial resolution across layers. To overcome these limitations, we propose the Complementary Masked Autoencoders (CoMA), which employ a complementary masking strategy to ensure uniform sampling across all pixels, thereby improving effective learning of all features and enhancing the model's adaptability. Furthermore, we introduce DyViT, a hierarchical vision transformer that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Domain Adaptation and Few-Shot Learning · Advanced Neural Network Applications