Using Diffusion Priors for Video Amodal Segmentation
Kaihua Chen, Deva Ramanan, Tarasha Khurana

TL;DR
This paper introduces a novel video amodal segmentation method that leverages diffusion priors and generative models to infer and inpaint occluded object regions, significantly outperforming existing approaches.
Contribution
It formulates video amodal segmentation as a conditional generation task using diffusion models, enabling better occlusion inference and object completion in videos.
Findings
Up to 13% improvement in amodal segmentation accuracy.
Effective handling of occlusions using temporal information.
Outperforms state-of-the-art methods on four datasets.
Abstract
Object permanence in humans is a fundamental cue that helps in understanding persistence of objects, even when they are fully occluded in the scene. Present day methods in object segmentation do not account for this amodal nature of the world, and only work for segmentation of visible or modal objects. Few amodal methods exist; single-image segmentation methods cannot handle high-levels of occlusions which are better inferred using temporal information, and multi-frame methods have focused solely on segmenting rigid objects. To this end, we propose to tackle video amodal segmentation by formulating it as a conditional generation task, capitalizing on the foundational knowledge in video generative models. Our method is simple; we repurpose these models to condition on a sequence of modal mask frames of an object along with contextual pseudo-depth maps, to learn which object boundary may…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsFace recognition and analysis · Speech and Audio Processing
