TL;DR
This paper introduces WSRVOS, a weakly-supervised method for referring video object segmentation that uses only text supervision and leverages large language models for data augmentation.
Contribution
The paper proposes a novel weakly-supervised RVOS approach utilizing text expressions, multimodal feature interaction, and pseudo-mask generation for training.
Findings
Outperforms existing weakly-supervised methods on multiple datasets.
Effectively generates high-quality pseudo-masks from text supervision.
Achieves competitive results compared to fully-supervised approaches.
Abstract
Referring video object segmentation (RVOS) aims to segment the target instance in a video, referred by a text expression. Conventional approaches are mostly supervised learning, requiring expensive pixel-level mask annotations. To tackle it, weakly-supervised RVOS has recently been proposed to replace mask annotations with bounding boxes or points, which are however still costly and labor-intensive. In this paper, we design a novel weakly-supervised RVOS method, namely WSRVOS, to train the model with only text expressions. Given an input video and the referring expression, we first design a contrastive referring expression augmentation scheme that leverages the captioning capabilities of a multimodal large language model to generate both positive and negative expressions. We extract visual and linguistic features from the input video and generated expressions, then perform…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
