From Ideal to Real: Stable Video Object Removal under Imperfect Conditions
Jiagao Hu, Yuxuan Chen, Fuhao Li, Zepeng Wang, Fei Wang, Daiguo Zhou, Jian Luan

TL;DR
This paper introduces SVOR, a robust video object removal framework that effectively handles shadows, motion, and mask defects, achieving state-of-the-art results in real-world scenarios.
Contribution
The paper proposes three novel techniques—MUSE, DA-Seg, and a two-stage training process—that significantly improve the stability and robustness of video object removal under imperfect conditions.
Findings
SVOR outperforms existing methods on multiple datasets.
The framework effectively removes shadows and reflections.
It maintains temporal stability and visual consistency in challenging scenarios.
Abstract
Removing objects from videos remains difficult in the presence of real-world imperfections such as shadows, abrupt motion, and defective masks. Existing diffusion-based video inpainting models often struggle to maintain temporal stability and visual consistency under these challenges. We propose Stable Video Object Removal (SVOR), a robust framework that achieves shadow-free, flicker-free, and mask-defect-tolerant removal through three key designs: (1) Mask Union for Stable Erasure (MUSE), a windowed union strategy applied during temporal mask downsampling to preserve all target regions observed within each window, effectively handling abrupt motion and reducing missed removals; (2) Denoising-Aware Segmentation (DA-Seg), a lightweight segmentation head on a decoupled side branch equipped with Denoising-Aware AdaLN and trained with mask degradation to provide an internal diffusion-aware…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
