Masked Diffusion as Self-supervised Representation Learner
Zixuan Pan, Jianxu Chen, Yiyu Shi

TL;DR
This paper introduces Masked Diffusion Model (MDM), a self-supervised learning method that replaces Gaussian noise with masking in diffusion models, significantly improving semantic segmentation performance in various domains.
Contribution
The paper proposes MDM, a novel diffusion-based self-supervised learning approach using masking instead of noise, enhancing segmentation tasks especially in few-shot settings.
Findings
Outperforms prior benchmarks in medical and natural image segmentation
Achieves significant improvements in few-shot segmentation scenarios
Demonstrates the effectiveness of masking in diffusion models
Abstract
Denoising diffusion probabilistic models have recently demonstrated state-of-the-art generative performance and have been used as strong pixel-level representation learners. This paper decomposes the interrelation between the generative capability and representation learning ability inherent in diffusion models. We present the masked diffusion model (MDM), a scalable self-supervised representation learner for semantic segmentation, substituting the conventional additive Gaussian noise of traditional diffusion with a masking mechanism. Our proposed approach convincingly surpasses prior benchmarks, demonstrating remarkable advancements in both medical and natural image semantic segmentation tasks, particularly in few-shot scenarios.
Peer Reviews
Decision·Submitted to ICLR 2024
1. The statistical results shown in Table 1 look promising if all the methods are compared fairly.
1. The writing of this paper needs to be improved. Many claims in the introduction section are not very well-supported (e.g. "such efforts risk deviating from the theoretical underpinnings of diffusions") and are not very well organized. 2. It is not convincing enough to conclude that the representation learned is better while only tested on segmentation downstream tasks. 3. The choice of SSIM over MSE is rather empirical and not well justified.
- The paper present an extensive experimental evaluations on 2 natural and 2 medical image data sets with ablation studies.
- My main concern is the novelty of the method. The paper mentions that with the fixed t, the method degrades to a vanilla masked autoencoder with SSIM loss. This basically means that the only contribution of the paper is masking the image with a dynamic masking ratio during training, which concerns me regarding the contribution of the paper. - Although the improvement achieved by this small change is interesting on Glas 10% case (IOU is 76.19 for MAE and 82.70 MDM with MSE; which is quite a si
Strengths: 1. **Novel Concept**: The paper presents a new approach in self-supervised learning using diffusion models. The probabilistic mask for data occlusion offers an alternative to traditional static methods, suggesting a different way to approach representation learning. 2. **Empirical Evidence**: The results and ablation studies provide evidence of the method's performance. The proposed technique shows improvements over vanilla MAE, DDPM, and certain traditional models on segmentation d
Areas of Improvement for the Paper: 1. **Benchmarking for Segmentation Tasks**: The primary focus on segmentation necessitates benchmarking against specialized self-supervised learning (SSL) methods designed for this task, both at the instance-level and pixel/patch-level. A direct comparison with methods such as Leopart, IIC, MaskContrast, DenseCL, MoCoV2, and DINO on standard datasets like COCO and PVOC would provide a holistic evaluation. Refer to paper: Self-Supervised Learning of Object Par
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Model Reduction and Neural Networks · Computational Physics and Python Applications
MethodsDiffusion
