Referred by Multi-Modality: A Unified Temporal Transformer for Video Object Segmentation
Shilin Yan, Renrui Zhang, Ziyu Guo, Wenchao Chen, Wei Zhang, Hongyang, Li, Yu Qiao, Hao Dong, Zhongjiang He, Peng Gao

TL;DR
MUTR introduces a unified transformer framework for referring video object segmentation that effectively integrates multi-modal signals like text and audio with temporal information, improving segmentation accuracy.
Contribution
The paper presents the first unified transformer-based approach for multi-modal VOS, incorporating temporal relations for both low-level aggregation and high-level feature interaction.
Findings
Achieves +4.2% J&F on Ref-YouTube-VOS
Achieves +8.7% J&F on AVSBench
Demonstrates superior performance over state-of-the-art methods
Abstract
Recently, video object segmentation (VOS) referred by multi-modal signals, e.g., language and audio, has evoked increasing attention in both industry and academia. It is challenging for exploring the semantic alignment within modalities and the visual correspondence across frames. However, existing methods adopt separate network architectures for different modalities, and neglect the inter-frame temporal interaction with references. In this paper, we propose MUTR, a Multi-modal Unified Temporal transformer for Referring video object segmentation. With a unified framework for the first time, MUTR adopts a DETR-style transformer and is capable of segmenting video objects designated by either text or audio reference. Specifically, we introduce two strategies to fully explore the temporal relations between videos and multi-modal signals. Firstly, for low-level temporal aggregation before…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsVideo Analysis and Summarization · Advanced Image and Video Retrieval Techniques · Multimodal Machine Learning Applications
MethodsVOS
