Referred by Multi-Modality: A Unified Temporal Transformer for Video   Object Segmentation

Shilin Yan; Renrui Zhang; Ziyu Guo; Wenchao Chen; Wei Zhang; Hongyang; Li; Yu Qiao; Hao Dong; Zhongjiang He; Peng Gao

arXiv:2305.16318·cs.CV·December 13, 2023·1 cites

Referred by Multi-Modality: A Unified Temporal Transformer for Video Object Segmentation

Shilin Yan, Renrui Zhang, Ziyu Guo, Wenchao Chen, Wei Zhang, Hongyang, Li, Yu Qiao, Hao Dong, Zhongjiang He, Peng Gao

PDF

Open Access 1 Repo 1 Video

TL;DR

MUTR introduces a unified transformer framework for referring video object segmentation that effectively integrates multi-modal signals like text and audio with temporal information, improving segmentation accuracy.

Contribution

The paper presents the first unified transformer-based approach for multi-modal VOS, incorporating temporal relations for both low-level aggregation and high-level feature interaction.

Findings

01

Achieves +4.2% J&F on Ref-YouTube-VOS

02

Achieves +8.7% J&F on AVSBench

03

Demonstrates superior performance over state-of-the-art methods

Abstract

Recently, video object segmentation (VOS) referred by multi-modal signals, e.g., language and audio, has evoked increasing attention in both industry and academia. It is challenging for exploring the semantic alignment within modalities and the visual correspondence across frames. However, existing methods adopt separate network architectures for different modalities, and neglect the inter-frame temporal interaction with references. In this paper, we propose MUTR, a Multi-modal Unified Temporal transformer for Referring video object segmentation. With a unified framework for the first time, MUTR adopts a DETR-style transformer and is capable of segmenting video objects designated by either text or audio reference. Specifically, we introduce two strategies to fully explore the temporal relations between videos and multi-modal signals. Firstly, for low-level temporal aggregation before…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

opengvlab/mutr
pytorchOfficial

Videos

Referred by Multi-Modality: A Unified Temporal Transformer for Video Object Segmentation· underline

Taxonomy

TopicsVideo Analysis and Summarization · Advanced Image and Video Retrieval Techniques · Multimodal Machine Learning Applications

MethodsVOS