RealD$^2$iff: Bridging Real-World Gap in Robot Manipulation via Depth Diffusion
Xiujian Liang, Jiacheng Liu, Mingyang Sun, Qichen He, Cewu Lu, Jianhua Sun

TL;DR
This paper introduces RealD$^2$iff, a diffusion-based framework that synthesizes realistic noisy depth data from simulation, significantly enhancing zero-shot sim2real robot manipulation by bridging the visual gap caused by sensor noise.
Contribution
The work presents a hierarchical diffusion model with novel global and local noise modeling strategies, enabling realistic depth synthesis and zero-shot sim2real transfer in robotic manipulation.
Findings
Effective depth noise synthesis from simulation
Zero-shot sim2real robot manipulation achieved
No manual real sensor data collection needed
Abstract
Robot manipulation in the real world is fundamentally constrained by the visual sim2real gap, where depth observations collected in simulation fail to reflect the complex noise patterns inherent to real sensors. In this work, inspired by the denoising capability of diffusion models, we invert the conventional perspective and propose a clean-to-noisy paradigm that learns to synthesize noisy depth, thereby bridging the visual sim2real gap through purely simulation-driven robotic learning. Building on this idea, we introduce RealDiff, a hierarchical coarse-to-fine diffusion framework that decomposes depth noise into global structural distortions and fine-grained local perturbations. To enable progressive learning of these components, we further develop two complementary strategies: Frequency-Guided Supervision (FGS) for global structure modeling and Discrepancy-Guided Optimization…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRobot Manipulation and Learning · 3D Shape Modeling and Analysis · Generative Adversarial Networks and Image Synthesis
