Temporal Collection and Distribution for Referring Video Object Segmentation
Jiajin Tang, Ge Zheng, Sibei Yang

TL;DR
This paper introduces a novel temporal collection-distribution mechanism for referring video object segmentation, improving the alignment of language, motion, and object segmentation across frames.
Contribution
It proposes a new temporal collection-distribution approach that enhances cross-modal reasoning and object motion modeling in referring video object segmentation.
Findings
Outperforms state-of-the-art methods on all benchmarks
Effectively captures object motions and spatial-temporal relationships
Improves global referent understanding and frame-level segmentation
Abstract
Referring video object segmentation aims to segment a referent throughout a video sequence according to a natural language expression. It requires aligning the natural language expression with the objects' motions and their dynamic associations at the global video level but segmenting objects at the frame level. To achieve this goal, we propose to simultaneously maintain a global referent token and a sequence of object queries, where the former is responsible for capturing video-level referent according to the language expression, while the latter serves to better locate and segment objects with each frame. Furthermore, to explicitly capture object motions and spatial-temporal cross-modal reasoning over objects, we propose a novel temporal collection-distribution mechanism for interacting between the global referent token and object queries. Specifically, the temporal collection…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Video Analysis and Summarization
