Learning Mask Invariant Mutual Information for Masked Image Modeling
Tao Huang, Yanxiang Ma, Shan You, Chang Xu

TL;DR
This paper introduces MI-MAE, a new method for masked image modeling that applies the information bottleneck principle to optimize latent features via mutual information, leading to improved performance across vision tasks.
Contribution
It provides a theoretical framework based on information theory for understanding MAEs and proposes MI-MAE, a novel mutual information-based optimization approach.
Findings
MI-MAE outperforms standard MAEs on image classification.
The approach improves object detection and segmentation results.
Theoretical analysis confirms the effectiveness of the information bottleneck in MAEs.
Abstract
Masked autoencoders (MAEs) represent a prominent self-supervised learning paradigm in computer vision. Despite their empirical success, the underlying mechanisms of MAEs remain insufficiently understood. Recent studies have attempted to elucidate the functioning of MAEs through contrastive learning and feature representation analysis, yet these approaches often provide only implicit insights. In this paper, we propose a new perspective for understanding MAEs by leveraging the information bottleneck principle in information theory. Our theoretical analyses reveal that optimizing the latent features to balance relevant and irrelevant information is key to improving MAE performance. Building upon our proofs, we introduce MI-MAE, a novel method that optimizes MAEs through mutual information maximization and minimization. By enhancing latent features to retain maximal relevant information…
Peer Reviews
Decision·ICLR 2025 Poster
1. This work proposes a theoretical analysis on the performance of MAE. It has proven that MAE under information bottleneck theory can achieve better performance theoretically. 2. A novel but simple architecture is proposed to apply information bottleneck theory to MAE, which can improve its performance. 3. Experiments are conducted on diverse tasks, which is convinced to prove the claim.
1. The illustration of the model in Section 4.2 is not very clear. I am not sure about why the architecture can achieve the separation of the relevant and irrelevant parts of the latent space. 2. There are several studies that introduce the isolation of the latent space with VAE, GAN, or diffusion model. Therefore, I am not sure about the novelty of the proposed MI-MAE in visual tasks. 3. Experiments only cover several general visual tasks and the results seem not to be significantly better tha
Originality: this is the first paper to study MAE under information bottleneck. It re-interprets MAE as minimizing a Lagrangian term that includes two terms: the simplest effective description of input and distortion of the network. It then argues that the MAE can only find a suboptimal solution. Finally it introduced two terms, based on explicit terms (MI maximization and minimization) to enforce IB principle, to improve MAE. Significance: the knowledge of interpreting MAE from an IB viewpoint
Clarity: The overall motivation is clear, but many small explanations are missing. (a) First, it is unclear from the discussions after Eq. (3), why MAE can only find a sub-optimal effective description. The reviewer would appreciate more explanations, such as specific constraints or limitations, preferably with formal proofs, that prevent MAE from finding the optimal solution (b) Sorry if the reviewer has missed it, but it is not clear, after Eq. (4), why mitigating the bias $r$ would help M
1. This work provides some new perspective in analyzing MAE with MI backed motivations, and resulting improvements demonstrated the practicability of applying mutual information maximization and minimization within latent representations and between inputs. 2. The paper demonstrates MI-MAE’s efficacy across a variety of vision tasks, including image classification, object detection, and semantic segmentation. In reported results, MI-MAE shows better efficiency in terms of number of training epoc
1. Increased complexity in training. Additional loss terms $l^{max_mi}$ and $l^{min_mi}$ requires weighting parameters introduced (i.e. $\lambda_1,\lambda_2, \lambda_3$) which are empirically determined, this makes the optimization more complicated than vanilla MAE. 2. This method uses an approximation network to estimate variational distributions for mutual information minimization. It also introduces another layer of approximation, which may not capture the true complexity of the mutual inform
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Domain Adaptation and Few-Shot Learning · Face recognition and analysis
MethodsContrastive Learning · Masked autoencoder
