SemanticMIM: Marring Masked Image Modeling with Semantics Compression for General Visual Representation
Yike Yuan, Huanzhang Dou, Fengjun Guo, and Xi Li

TL;DR
SemanticMIM is a novel framework that combines masked image modeling and contrastive learning to improve general visual representations by leveraging their complementary strengths in compression and reconstruction.
Contribution
It introduces a proxy architecture that effectively integrates MIM and CL, enhancing semantic awareness and interpretability in visual representations.
Findings
Significant performance improvements over existing methods.
Enhanced feature linear separability and semantic understanding.
Improved interpretability through attention visualization.
Abstract
This paper represents a neat yet effective framework, named SemanticMIM, to integrate the advantages of masked image modeling (MIM) and contrastive learning (CL) for general visual representation. We conduct a thorough comparative analysis between CL and MIM, revealing that their complementary advantages fundamentally stem from two distinct phases, i.e., compression and reconstruction. Specifically, SemanticMIM leverages a proxy architecture that customizes interaction between image and mask tokens, bridging these two phases to achieve general visual representation with the property of abundant semantic and positional awareness. Through extensive qualitative and quantitative evaluations, we demonstrate that SemanticMIM effectively amalgamates the benefits of CL and MIM, leading to significant enhancement of performance and feature linear separability. SemanticMIM also offers notable…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Image and Video Retrieval Techniques · Image Retrieval and Classification Techniques · 3D Surveying and Cultural Heritage
MethodsContrastive Learning · Mutual Information Machine/Mask Image Modeling
