Temporal-Conditional Referring Video Object Segmentation with Noise-Free Text-to-Video Diffusion Model
Ruixin Zhang, Jiaqing Fan, Yifan Liao, Qian Qiao, Fanzhang Li

TL;DR
This paper introduces a novel RVOS model that enhances segmentation accuracy by integrating a temporal-conditional segmentation head, leveraging a noise-free text-to-video diffusion model, and employing a temporal context mask refinement module, achieving state-of-the-art results.
Contribution
The paper presents a new RVOS approach that combines a segmentation head with a diffusion model and a mask refinement module, addressing previous limitations in feature extraction and temporal modeling.
Findings
Achieves state-of-the-art performance on four RVOS benchmarks.
Effectively improves boundary segmentation accuracy.
Simplifies the model by removing noise prediction, enhancing robustness.
Abstract
Referring Video Object Segmentation (RVOS) aims to segment specific objects in a video according to textual descriptions. We observe that recent RVOS approaches often place excessive emphasis on feature extraction and temporal modeling, while relatively neglecting the design of the segmentation head. In fact, there remains considerable room for improvement in segmentation head design. To address this, we propose a Temporal-Conditional Referring Video Object Segmentation model, which innovatively integrates existing segmentation methods to effectively enhance boundary segmentation capability. Furthermore, our model leverages a text-to-video diffusion model for feature extraction. On top of this, we remove the traditional noise prediction module to avoid the randomness of noise from degrading segmentation accuracy, thereby simplifying the model while improving performance. Finally, to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
