Diffusion Knows Transparency: Repurposing Video Diffusion for Transparent Object Depth and Normal Estimation

Shaocong Xu; Songlin Wei; Qizhe Wei; Zheng Geng; Hong Li; Licheng Shen; Qianpu Sun; Shu Han; Bin Ma; Bohan Li; Chongjie Ye; Yuhang Zheng; Nan Wang; Saining Zhang; and Hao Zhao

arXiv:2512.23705·cs.CV·December 30, 2025

Diffusion Knows Transparency: Repurposing Video Diffusion for Transparent Object Depth and Normal Estimation

Shaocong Xu, Songlin Wei, Qizhe Wei, Zheng Geng, Hong Li, Licheng Shen, Qianpu Sun, Shu Han, Bin Ma, Bohan Li, Chongjie Ye, Yuhang Zheng, Nan Wang, Saining Zhang, and Hao Zhao

PDF

Open Access 3 Models 2 Datasets

TL;DR

This paper leverages pre-trained video diffusion models to accurately estimate depth and normals of transparent objects in videos, overcoming traditional perception challenges through a novel synthetic dataset and a lightweight adaptation approach.

Contribution

It introduces TransPhy3D, a synthetic dataset for transparent scenes, and a method to adapt diffusion models for temporally consistent depth and normal estimation without supervision.

Findings

01

Achieves state-of-the-art zero-shot performance on transparency benchmarks.

02

Improves temporal consistency and accuracy over existing methods.

03

Enhances robotic grasping success on transparent and reflective objects.

Abstract

Transparent objects remain notoriously hard for perception systems: refraction, reflection and transmission break the assumptions behind stereo, ToF and purely discriminative monocular depth, causing holes and temporally unstable estimates. Our key observation is that modern video diffusion models already synthesize convincing transparent phenomena, suggesting they have internalized the optical rules. We build TransPhy3D, a synthetic video corpus of transparent/reflective scenes: 11k sequences rendered with Blender/Cycles. Scenes are assembled from a curated bank of category-rich static assets and shape-rich procedural assets paired with glass/plastic/metal materials. We render RGB + depth + normals with physically based ray tracing and OptiX denoising. Starting from a large video diffusion model, we learn a video-to-video translator for depth (and normals) via lightweight LoRA…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Advanced Vision and Imaging · Image Enhancement Techniques