DepthPilot: From Controllability to Interpretability in Colonoscopy Video Generation

Junhu Fu; Ke Chen; Weidong Guo; Shuyu Liang; Jie Xu; Chen Ma; Kehao Wang; Shengli Lin; Zeju Li; Yuanyuan Wang; Yi Guo; Shuo Li

arXiv:2604.26232·cs.CV·April 30, 2026

DepthPilot: From Controllability to Interpretability in Colonoscopy Video Generation

Junhu Fu, Ke Chen, Weidong Guo, Shuyu Liang, Jie Xu, Chen Ma, Kehao Wang, Shengli Lin, Zeju Li, Yuanyuan Wang, Yi Guo, Shuo Li

PDF

TL;DR

DepthPilot introduces an interpretable colonoscopy video generation framework that aligns with physical priors and clinical features, enhancing trustworthiness and clinical utility.

Contribution

It is the first framework to incorporate explicit geometric grounding and adaptive nonlinear modeling for realistic, interpretable medical video synthesis.

Findings

01

Achieves FID scores below 15 on all benchmarks.

02

Ranks first in clinician assessments for interpretability.

03

Produces physically consistent and clinically relevant videos.

Abstract

Controllable medical video generation has achieved remarkable progress, but it still lacks interpretability, which requires the alignment of generated contents with physical priors and faithful clinical manifestations. To push the boundaries from mere controllability to interpretability, we propose DepthPilot, the first interpretable framework for colonoscopy video generation. This work takes a step toward trustworthy generation through two synergistic paradigms. To achieve explicit geometric grounding, DepthPilot devises a prior distribution alignment strategy, injecting depth constraints into the diffusion backbone via parameter-efficient fine-tuning to ensure anatomical fidelity. To enhance intrinsic nonlinear modeling under these geometric constraints, DepthPilot employs an adaptive spline denoising module, replacing fixed linear weights with learnable spline functions to capture…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.