Referring Video Object Segmentation with Cross-Modality Proxy Queries

Baoli Sun; Xinzhu Ma; Ning Wang; Zhihui Wang; Zhiyong Wang

arXiv:2511.21139·cs.CV·November 27, 2025

Referring Video Object Segmentation with Cross-Modality Proxy Queries

Baoli Sun, Xinzhu Ma, Ning Wang, Zhihui Wang, Zhiyong Wang

PDF

Open Access

TL;DR

This paper introduces ProxyFormer, a novel RVOS model that uses proxy queries to improve cross-modality alignment and inter-frame dependency modeling, leading to more accurate and coherent video object segmentation.

Contribution

ProxyFormer employs proxy queries to dynamically integrate visual and textual semantics across multiple stages, enhancing target tracking and inter-frame dependency modeling in RVOS.

Findings

01

Outperforms state-of-the-art on four RVOS benchmarks.

02

Effectively models inter-frame dependencies and semantic alignment.

03

Reduces computational costs through decoupled cross-modality interactions.

Abstract

Referring video object segmentation (RVOS) is an emerging cross-modality task that aims to generate pixel-level maps of the target objects referred by given textual expressions. The main concept involves learning an accurate alignment of visual elements and language expressions within a semantic space. Recent approaches address cross-modality alignment through conditional queries, tracking the target object using a query-response based mechanism built upon transformer structure. However, they exhibit two limitations: (1) these conditional queries lack inter-frame dependency and variation modeling, making accurate target tracking challenging amid significant frame-to-frame variations; and (2) they integrate textual constraints belatedly, which may cause the video features potentially focus on the non-referred objects. Therefore, we propose a novel RVOS architecture called ProxyFormer,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Visual Attention and Saliency Detection · Video Analysis and Summarization