CDG-MAE: Learning Correspondences from Diffusion Generated Views

Varun Belagali; Pierre Marza; Srikar Yellapragada; Zilinghan Li; Tarak Nath Nandi; Ravi K Madduri; Joel Saltz; Stergios Christodoulidis; Maria Vakalopoulou; Dimitris Samaras

arXiv:2506.18164·cs.CV·June 24, 2025

CDG-MAE: Learning Correspondences from Diffusion Generated Views

Varun Belagali, Pierre Marza, Srikar Yellapragada, Zilinghan Li, Tarak Nath Nandi, Ravi K Madduri, Joel Saltz, Stergios Christodoulidis, Maria Vakalopoulou, Dimitris Samaras

PDF

3 Reviews

TL;DR

CDG-MAE introduces a self-supervised learning approach that generates diverse synthetic views from static images using diffusion models, enabling improved dense correspondence learning without extensive video data.

Contribution

The paper proposes a novel MAE-based method utilizing synthetic views from diffusion models, enhancing self-supervised correspondence learning beyond traditional image crops and reducing reliance on video datasets.

Findings

01

Outperforms existing image-based MAE methods in correspondence tasks.

02

Effectively narrows the performance gap between image-based and video-based approaches.

03

Demonstrates the effectiveness of synthetic view generation for self-supervised learning.

Abstract

Learning dense correspondences, critical for application such as video label propagation, is hindered by tedious and unscalable manual annotation. Self-supervised methods address this by using a cross-view pretext task, often modeled with a masked autoencoder, where a masked target view is reconstructed from an anchor view. However, acquiring effective training data remains a challenge - collecting diverse video datasets is difficult and costly, while simple image crops lack necessary pose variations. This paper introduces CDG-MAE, a novel MAE-based self-supervised method that uses diverse synthetic views generated from static images via an image-conditioned diffusion model. These generated views exhibit substantial changes in pose and perspective, providing a rich training signal that overcomes the limitations of video and crop-based anchors. We present a quantitative method to…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 4Confidence 3

Strengths

- Creative use of diffusion for correspondence learning, addressing the lack of video data for cross-view pretraining. - Multi-anchor masking is a well-motivated and effective extension to SiamMAE. - Comprehensive experiments show consistent gains across three datasets, with strong ablations on masking ratios and diffusion backbones.

Weaknesses

- The technical novelty mainly lies in the proposed consistency metrics (GS–LS–NPS) for selecting diffusion-generated views, but their contribution is not deeply analyzed (e.g., what if LS is omitted, or completely remove this metric or GS alone suffices?). - Other elements (diffusion-based augmentation, Siamese MAE) are incremental combinations of prior work (Gen-SIS, CropMAE). - The experimental organization could be improved by presenting the main comparison table earlier.

Reviewer 02Rating 6Confidence 3

Strengths

- The idea of introducing self-supervision diversity through diffusion-generated images is interesting and addresses well-identified issues of crop and video strategies. - The proposed multi-anchor and anchor masking techniques are sound and seem to be effective. - The ablation on the design choices is solid and covers a lot of variables. - The proposed model achieves the state of the art in most of the metrics, proving the performance claims. The authors show that their approach closes the gap

Weaknesses

- There is not a lot of discussion on the choice of the diffusion model. The authors have chosen an augmentation model. I wonder if novel view models (e.g. ViewCrafter) were considered. It would be a great comparison, and such an approach could enable control over the camera pose. - It is not fully clear what the impact of separate components is. You could potentially apply multi-anchor and anchor masking to the CropMAE approach and investigate how that affects the performance. - I would like t

Reviewer 03Rating 2Confidence 3

Strengths

- The paper explores the use of diffusion models to generate cross-view data for self-supervised MAE training, offering an alternative way to learn view-consistent representations from static images. - The study includes systematic ablations on diffusion model choice, number of anchors, masking ratios, and patch sizes, with consistent results and clear performance trends. - The method achieves performance close to video-based models when trained only on static images, showing feasibility under c

Weaknesses

- **Outdated motivation:** The central premise, that using video data is costly, is no longer convincing given the availability of large-scale open video datasets and efficient video generation models (e.g., Cosmos, HunyuanVideo, Wan). The motivation therefore is outdated and lacks contemporary relevance. - The image diffusion-generated views are uncontrolled and may not preserve true viewpoint or structural consistency. As a result, the model primarily learns perceptual similarity rather than g

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.