TL;DR
CDG-MAE introduces a self-supervised learning approach that generates diverse synthetic views from static images using diffusion models, enabling improved dense correspondence learning without extensive video data.
Contribution
The paper proposes a novel MAE-based method utilizing synthetic views from diffusion models, enhancing self-supervised correspondence learning beyond traditional image crops and reducing reliance on video datasets.
Findings
Outperforms existing image-based MAE methods in correspondence tasks.
Effectively narrows the performance gap between image-based and video-based approaches.
Demonstrates the effectiveness of synthetic view generation for self-supervised learning.
Abstract
Learning dense correspondences, critical for application such as video label propagation, is hindered by tedious and unscalable manual annotation. Self-supervised methods address this by using a cross-view pretext task, often modeled with a masked autoencoder, where a masked target view is reconstructed from an anchor view. However, acquiring effective training data remains a challenge - collecting diverse video datasets is difficult and costly, while simple image crops lack necessary pose variations. This paper introduces CDG-MAE, a novel MAE-based self-supervised method that uses diverse synthetic views generated from static images via an image-conditioned diffusion model. These generated views exhibit substantial changes in pose and perspective, providing a rich training signal that overcomes the limitations of video and crop-based anchors. We present a quantitative method to…
Peer Reviews
Decision·Submitted to ICLR 2026
- Creative use of diffusion for correspondence learning, addressing the lack of video data for cross-view pretraining. - Multi-anchor masking is a well-motivated and effective extension to SiamMAE. - Comprehensive experiments show consistent gains across three datasets, with strong ablations on masking ratios and diffusion backbones.
- The technical novelty mainly lies in the proposed consistency metrics (GS–LS–NPS) for selecting diffusion-generated views, but their contribution is not deeply analyzed (e.g., what if LS is omitted, or completely remove this metric or GS alone suffices?). - Other elements (diffusion-based augmentation, Siamese MAE) are incremental combinations of prior work (Gen-SIS, CropMAE). - The experimental organization could be improved by presenting the main comparison table earlier.
- The idea of introducing self-supervision diversity through diffusion-generated images is interesting and addresses well-identified issues of crop and video strategies. - The proposed multi-anchor and anchor masking techniques are sound and seem to be effective. - The ablation on the design choices is solid and covers a lot of variables. - The proposed model achieves the state of the art in most of the metrics, proving the performance claims. The authors show that their approach closes the gap
- There is not a lot of discussion on the choice of the diffusion model. The authors have chosen an augmentation model. I wonder if novel view models (e.g. ViewCrafter) were considered. It would be a great comparison, and such an approach could enable control over the camera pose. - It is not fully clear what the impact of separate components is. You could potentially apply multi-anchor and anchor masking to the CropMAE approach and investigate how that affects the performance. - I would like t
- The paper explores the use of diffusion models to generate cross-view data for self-supervised MAE training, offering an alternative way to learn view-consistent representations from static images. - The study includes systematic ablations on diffusion model choice, number of anchors, masking ratios, and patch sizes, with consistent results and clear performance trends. - The method achieves performance close to video-based models when trained only on static images, showing feasibility under c
- **Outdated motivation:** The central premise, that using video data is costly, is no longer convincing given the availability of large-scale open video datasets and efficient video generation models (e.g., Cosmos, HunyuanVideo, Wan). The motivation therefore is outdated and lacks contemporary relevance. - The image diffusion-generated views are uncontrolled and may not preserve true viewpoint or structural consistency. As a result, the model primarily learns perceptual similarity rather than g
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
