Weakly-Supervised Referring Video Object Segmentation through Text Supervision

Miaojing Shi; Jun Huang; Zijie Yue; Hanli Wang

arXiv:2604.17797·cs.CV·April 22, 2026

Weakly-Supervised Referring Video Object Segmentation through Text Supervision

Miaojing Shi, Jun Huang, Zijie Yue, Hanli Wang

PDF

1 Repo

TL;DR

This paper introduces WSRVOS, a weakly-supervised method for referring video object segmentation that uses only text supervision and leverages large language models for data augmentation.

Contribution

The paper proposes a novel weakly-supervised RVOS approach utilizing text expressions, multimodal feature interaction, and pseudo-mask generation for training.

Findings

01

Outperforms existing weakly-supervised methods on multiple datasets.

02

Effectively generates high-quality pseudo-masks from text supervision.

03

Achieves competitive results compared to fully-supervised approaches.

Abstract

Referring video object segmentation (RVOS) aims to segment the target instance in a video, referred by a text expression. Conventional approaches are mostly supervised learning, requiring expensive pixel-level mask annotations. To tackle it, weakly-supervised RVOS has recently been proposed to replace mask annotations with bounding boxes or points, which are however still costly and labor-intensive. In this paper, we design a novel weakly-supervised RVOS method, namely WSRVOS, to train the model with only text expressions. Given an input video and the referring expression, we first design a contrastive referring expression augmentation scheme that leverages the captioning capabilities of a multimodal large language model to generate both positive and negative expressions. We extract visual and linguistic features from the input video and generated expressions, then perform…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

viscom-tongji/WSRVOS
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.