PiLaMIM: Toward Richer Visual Representations by Integrating Pixel and   Latent Masked Image Modeling

Junmyeong Lee; Eui Jun Hwang; Sukmin Cho; Jong C. Park

arXiv:2501.03005·cs.CV·January 7, 2025

PiLaMIM: Toward Richer Visual Representations by Integrating Pixel and Latent Masked Image Modeling

Junmyeong Lee, Eui Jun Hwang, Sukmin Cho, Jong C. Park

PDF

Open Access 1 Repo

TL;DR

PiLaMIM is a unified masked image modeling framework that combines pixel-level and latent-level reconstruction to capture both low-level details and high-level semantics, leading to richer visual representations.

Contribution

It introduces a novel framework that integrates Pixel MIM and Latent MIM using a single encoder and two decoders, enhancing the quality of visual features learned.

Findings

01

PiLaMIM outperforms baselines like MAE, I-JEPA, and BootMAE in various tasks.

02

The method effectively captures both low-level and high-level visual features.

03

Incorporating the CLS token improves global context understanding.

Abstract

In Masked Image Modeling (MIM), two primary methods exist: Pixel MIM and Latent MIM, each utilizing different reconstruction targets, raw pixels and latent representations, respectively. Pixel MIM tends to capture low-level visual details such as color and texture, while Latent MIM focuses on high-level semantics of an object. However, these distinct strengths of each method can lead to suboptimal performance in tasks that rely on a particular level of visual features. To address this limitation, we propose PiLaMIM, a unified framework that combines Pixel MIM and Latent MIM to integrate their complementary strengths. Our method uses a single encoder along with two distinct decoders: one for predicting pixel values and another for latent representations, ensuring the capture of both high-level and low-level visual features. We further integrate the CLS token into the reconstruction…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

joonmy/pilamim
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Advanced Image and Video Retrieval Techniques · Generative Adversarial Networks and Image Synthesis

MethodsMutual Information Machine/Mask Image Modeling · Masked autoencoder