Spectrum-guided Multi-granularity Referring Video Object Segmentation
Bo Miao, Mohammed Bennamoun, Yongsheng Gao, Ajmal Mian

TL;DR
This paper introduces a spectrum-guided multi-granularity approach for referring video object segmentation that addresses feature drift issues, enabling more accurate and faster multi-object segmentation in videos.
Contribution
It proposes a novel spectrum-guided method for direct segmentation on encoded features and extends to multi-object R-VOS, improving speed and practicality.
Findings
Achieves state-of-the-art results on four benchmarks.
Outperforms competitors by 2.8% on Ref-YouTube-VOS.
Runs about 3 times faster in multi-object R-VOS mode.
Abstract
Current referring video object segmentation (R-VOS) techniques extract conditional kernels from encoded (low-resolution) vision-language features to segment the decoded high-resolution features. We discovered that this causes significant feature drift, which the segmentation kernels struggle to perceive during the forward computation. This negatively affects the ability of segmentation kernels. To address the drift problem, we propose a Spectrum-guided Multi-granularity (SgMg) approach, which performs direct segmentation on the encoded features and employs visual details to further optimize the masks. In addition, we propose Spectrum-guided Cross-modal Fusion (SCF) to perform intra-frame global interactions in the spectral domain for effective multimodal representation. Finally, we extend SgMg to perform multi-object R-VOS, a new paradigm that enables simultaneous segmentation of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Ear and Head Tumors · Speech and Audio Processing
