Temporal-Conditional Referring Video Object Segmentation with Noise-Free Text-to-Video Diffusion Model

Ruixin Zhang; Jiaqing Fan; Yifan Liao; Qian Qiao; Fanzhang Li

arXiv:2508.13584·cs.CV·August 20, 2025

Temporal-Conditional Referring Video Object Segmentation with Noise-Free Text-to-Video Diffusion Model

Ruixin Zhang, Jiaqing Fan, Yifan Liao, Qian Qiao, Fanzhang Li

PDF

TL;DR

This paper introduces a novel RVOS model that enhances segmentation accuracy by integrating a temporal-conditional segmentation head, leveraging a noise-free text-to-video diffusion model, and employing a temporal context mask refinement module, achieving state-of-the-art results.

Contribution

The paper presents a new RVOS approach that combines a segmentation head with a diffusion model and a mask refinement module, addressing previous limitations in feature extraction and temporal modeling.

Findings

01

Achieves state-of-the-art performance on four RVOS benchmarks.

02

Effectively improves boundary segmentation accuracy.

03

Simplifies the model by removing noise prediction, enhancing robustness.

Abstract

Referring Video Object Segmentation (RVOS) aims to segment specific objects in a video according to textual descriptions. We observe that recent RVOS approaches often place excessive emphasis on feature extraction and temporal modeling, while relatively neglecting the design of the segmentation head. In fact, there remains considerable room for improvement in segmentation head design. To address this, we propose a Temporal-Conditional Referring Video Object Segmentation model, which innovatively integrates existing segmentation methods to effectively enhance boundary segmentation capability. Furthermore, our model leverages a text-to-video diffusion model for feature extraction. On top of this, we remove the traditional noise prediction module to avoid the randomness of noise from degrading segmentation accuracy, thereby simplifying the model while improving performance. Finally, to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.