IDCNet: Guided Video Diffusion for Metric-Consistent RGBD Scene Generation with Precise Camera Control

Lijuan Liu; Wenfa Li; Dongbo Zhang; Shuo Wang; Shaohui Jiao

arXiv:2508.04147·cs.CV·August 7, 2025

IDCNet: Guided Video Diffusion for Metric-Consistent RGBD Scene Generation with Precise Camera Control

Lijuan Liu, Wenfa Li, Dongbo Zhang, Shuo Wang, Shaohui Jiao

PDF

TL;DR

IDC-Net is a unified diffusion-based framework that jointly generates RGB-D video sequences with precise camera control, improving geometric consistency and enabling direct use in 3D scene reconstruction.

Contribution

It introduces a geometry-aware diffusion model with a transformer block for fine-grained camera control and a new dataset for training.

Findings

01

Outperforms state-of-the-art in visual quality and geometric consistency.

02

Enables direct use of generated sequences for 3D reconstruction.

03

Provides a new dataset with metric-aligned RGB, depth, and camera poses.

Abstract

We present IDC-Net (Image-Depth Consistency Network), a novel framework designed to generate RGB-D video sequences under explicit camera trajectory control. Unlike approaches that treat RGB and depth generation separately, IDC-Net jointly synthesizes both RGB images and corresponding depth maps within a unified geometry-aware diffusion model. The joint learning framework strengthens spatial and geometric alignment across frames, enabling more precise camera control in the generated sequences. To support the training of this camera-conditioned model and ensure high geometric fidelity, we construct a camera-image-depth consistent dataset with metric-aligned RGB videos, depth maps, and accurate camera poses, which provides precise geometric supervision with notably improved inter-frame geometric consistency. Moreover, we introduce a geometry-aware transformer block that enables…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.