Diff-2-in-1: Bridging Generation and Dense Perception with Diffusion Models
Shuhong Zheng, Zhipeng Bao, Ruoyu Zhao, Martial Hebert, Yu-Xiong Wang

TL;DR
This paper introduces Diff-2-in-1, a unified diffusion-based framework that simultaneously advances multi-modal data generation and dense visual perception, enhancing performance through a self-improving learning mechanism.
Contribution
The work presents a novel unified diffusion framework that integrates generation and perception tasks, utilizing the denoising process for multi-modal data creation and self-improvement, which was not explored before.
Findings
Consistent performance improvements across various backbones.
High-quality multi-modal data generation with realism and usefulness.
Effective enhancement of discriminative visual perception.
Abstract
Beyond high-fidelity image synthesis, diffusion models have recently exhibited promising results in dense visual perception tasks. However, most existing work treats diffusion models as a standalone component for perception tasks, employing them either solely for off-the-shelf data augmentation or as mere feature extractors. In contrast to these isolated and thus sub-optimal efforts, we introduce a unified, versatile, diffusion-based framework, Diff-2-in-1, that can simultaneously handle both multi-modal data generation and dense visual perception, through a unique exploitation of the diffusion-denoising process. Within this framework, we further enhance discriminative visual perception via multi-modal generation, by utilizing the denoising network to create multi-modal data that mirror the distribution of the original training set. Importantly, Diff-2-in-1 optimizes the utilization of…
Peer Reviews
Decision·ICLR 2025 Poster
1. Unified Framework: The paper integrates generative and discriminative processes to improve discriminative tasks using the diffusion model. 2. Favorable performance on benchmarks: The method is evaluated on NYUD-MT and PASCAL-Context, and shows consistent performance improvements.
1. The term “unified framework” is overstated and potentially misleading. The proposed method does not employ a single model to handle both data generation and prediction. Instead, it uses a pre-trained latent diffusion model for RGB image generation and a separate task-specific head for discriminative tasks. While there is interaction between these components during the update process, the term “unified” suggests a more cohesive integration than what is presented. From the description, it is li
1. The approach of using synthetic data pairs and integrating a self-improving mechanism is interesting and valuable for discriminative tasks using pre-trained diffusion models. 2. The method shows promise in scenarios with limited training data. 3. Comprehensive experiments and ablation studies across multiple dense prediction tasks (surface normal estimation, semantic segmentation, depth estimation) and multiple datasets demonstrate the effectiveness and versatility of the approach.
1. The overall writing lacks clarity. The paper's main focus is on improving discriminative learning using synthetic data. Claiming it as "A single, unified diffusion-based model for both generative and discriminative learning" may be not proper, as the method does not improve generative learning. 2. The method section should clearly state which parameters are trained or frozen during different stages. 3. Figure 2 needs clarification. If understood correctly, in the left subfigure, the data crea
1. The authors propose a novel method with self-improving mechanism to bridge high-fidelity image synthesis and dense visual perception tasks. 2. The paper is well-written and the structure is well-organized and easy to follow. 3. The authors conduct extensive experiments and ablation studies to demonstrate the superiority of their designs, providing impressive qualitative and quantitative results.
1. The content of preliminary can be appropriately reduced, and add several visual comparisons from the appendix to the main paper. 2. It can be observed from Table.2 that the Diff-2-in-1 only improve the diffusion-based the segmentation method by 0.5, which seems to be incremental. Please explain the reason.
Videos
Taxonomy
TopicsNeural Networks and Applications
MethodsDiffusion
