PiLaMIM: Toward Richer Visual Representations by Integrating Pixel and Latent Masked Image Modeling
Junmyeong Lee, Eui Jun Hwang, Sukmin Cho, Jong C. Park

TL;DR
PiLaMIM is a unified masked image modeling framework that combines pixel-level and latent-level reconstruction to capture both low-level details and high-level semantics, leading to richer visual representations.
Contribution
It introduces a novel framework that integrates Pixel MIM and Latent MIM using a single encoder and two decoders, enhancing the quality of visual features learned.
Findings
PiLaMIM outperforms baselines like MAE, I-JEPA, and BootMAE in various tasks.
The method effectively captures both low-level and high-level visual features.
Incorporating the CLS token improves global context understanding.
Abstract
In Masked Image Modeling (MIM), two primary methods exist: Pixel MIM and Latent MIM, each utilizing different reconstruction targets, raw pixels and latent representations, respectively. Pixel MIM tends to capture low-level visual details such as color and texture, while Latent MIM focuses on high-level semantics of an object. However, these distinct strengths of each method can lead to suboptimal performance in tasks that rely on a particular level of visual features. To address this limitation, we propose PiLaMIM, a unified framework that combines Pixel MIM and Latent MIM to integrate their complementary strengths. Our method uses a single encoder along with two distinct decoders: one for predicting pixel values and another for latent representations, ensuring the capture of both high-level and low-level visual features. We further integrate the CLS token into the reconstruction…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Advanced Image and Video Retrieval Techniques · Generative Adversarial Networks and Image Synthesis
MethodsMutual Information Machine/Mask Image Modeling · Masked autoencoder
