Referring Video Object Segmentation with Cross-Modality Proxy Queries
Baoli Sun, Xinzhu Ma, Ning Wang, Zhihui Wang, Zhiyong Wang

TL;DR
This paper introduces ProxyFormer, a novel RVOS model that uses proxy queries to improve cross-modality alignment and inter-frame dependency modeling, leading to more accurate and coherent video object segmentation.
Contribution
ProxyFormer employs proxy queries to dynamically integrate visual and textual semantics across multiple stages, enhancing target tracking and inter-frame dependency modeling in RVOS.
Findings
Outperforms state-of-the-art on four RVOS benchmarks.
Effectively models inter-frame dependencies and semantic alignment.
Reduces computational costs through decoupled cross-modality interactions.
Abstract
Referring video object segmentation (RVOS) is an emerging cross-modality task that aims to generate pixel-level maps of the target objects referred by given textual expressions. The main concept involves learning an accurate alignment of visual elements and language expressions within a semantic space. Recent approaches address cross-modality alignment through conditional queries, tracking the target object using a query-response based mechanism built upon transformer structure. However, they exhibit two limitations: (1) these conditional queries lack inter-frame dependency and variation modeling, making accurate target tracking challenging amid significant frame-to-frame variations; and (2) they integrate textual constraints belatedly, which may cause the video features potentially focus on the non-referred objects. Therefore, we propose a novel RVOS architecture called ProxyFormer,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Visual Attention and Saliency Detection · Video Analysis and Summarization
