IDCNet: Guided Video Diffusion for Metric-Consistent RGBD Scene Generation with Precise Camera Control
Lijuan Liu, Wenfa Li, Dongbo Zhang, Shuo Wang, Shaohui Jiao

TL;DR
IDC-Net is a unified diffusion-based framework that jointly generates RGB-D video sequences with precise camera control, improving geometric consistency and enabling direct use in 3D scene reconstruction.
Contribution
It introduces a geometry-aware diffusion model with a transformer block for fine-grained camera control and a new dataset for training.
Findings
Outperforms state-of-the-art in visual quality and geometric consistency.
Enables direct use of generated sequences for 3D reconstruction.
Provides a new dataset with metric-aligned RGB, depth, and camera poses.
Abstract
We present IDC-Net (Image-Depth Consistency Network), a novel framework designed to generate RGB-D video sequences under explicit camera trajectory control. Unlike approaches that treat RGB and depth generation separately, IDC-Net jointly synthesizes both RGB images and corresponding depth maps within a unified geometry-aware diffusion model. The joint learning framework strengthens spatial and geometric alignment across frames, enabling more precise camera control in the generated sequences. To support the training of this camera-conditioned model and ensure high geometric fidelity, we construct a camera-image-depth consistent dataset with metric-aligned RGB videos, depth maps, and accurate camera poses, which provides precise geometric supervision with notably improved inter-frame geometric consistency. Moreover, we introduce a geometry-aware transformer block that enables…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
