Masked Modeling for Self-supervised Representation Learning on Vision and Beyond
Siyuan Li, Luyuan Zhang, Zedong Wang, Di Wu, Lirong Wu, Zicheng Liu,, Jun Xia, Cheng Tan, Yang Liu, Baigui Sun, Stan Z. Li

TL;DR
This paper provides a comprehensive review of masked modeling techniques in self-supervised learning, highlighting its methodologies, applications across domains, and future research directions.
Contribution
It systematically analyzes masked modeling frameworks, compares methods across fields, and discusses limitations and future prospects in self-supervised representation learning.
Findings
Masked modeling enhances robust representation learning.
It is effective across vision, language, and other modalities.
The survey identifies key challenges and future research directions.
Abstract
As the deep learning revolution marches on, self-supervised learning has garnered increasing attention in recent years thanks to its remarkable representation learning ability and the low dependence on labeled data. Among these varied self-supervised techniques, masked modeling has emerged as a distinctive approach that involves predicting parts of the original data that are proportionally masked during training. This paradigm enables deep models to learn robust representations and has demonstrated exceptional performance in the context of computer vision, natural language processing, and other modalities. In this survey, we present a comprehensive review of the masked modeling framework and its methodology. We elaborate on the details of techniques within masked modeling, including diverse masking strategies, recovering targets, network architectures, and more. Then, we systematically…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDomain Adaptation and Few-Shot Learning · Multimodal Machine Learning Applications · Human Pose and Action Recognition
