End-to-End Referring Video Object Segmentation with Multimodal   Transformers

Adam Botach; Evgenii Zheltonozhskii; Chaim Baskin

arXiv:2111.14821·cs.CV·April 5, 2022

End-to-End Referring Video Object Segmentation with Multimodal Transformers

Adam Botach, Evgenii Zheltonozhskii, Chaim Baskin

PDF

Open Access 2 Repos 1 Video

TL;DR

This paper introduces MTTR, a simple end-to-end multimodal Transformer model for referring video object segmentation that outperforms previous methods and simplifies the pipeline by avoiding complex components.

Contribution

The paper presents a novel Transformer-based framework for RVOS that models the task as sequence prediction, eliminating the need for complex pipelines and post-processing.

Findings

01

Significantly outperforms previous methods on standard benchmarks.

02

Achieves +5.7 and +5.0 mAP gains on A2D-Sentences and JHMDB-Sentences datasets.

03

Processes 76 frames per second, demonstrating efficiency.

Abstract

The referring video object segmentation task (RVOS) involves segmentation of a text-referred object instance in the frames of a given video. Due to the complex nature of this multimodal task, which combines text reasoning, video understanding, instance segmentation and tracking, existing approaches typically rely on sophisticated pipelines in order to tackle it. In this paper, we propose a simple Transformer-based approach to RVOS. Our framework, termed Multimodal Tracking Transformer (MTTR), models the RVOS task as a sequence prediction problem. Following recent advancements in computer vision and natural language processing, MTTR is based on the realization that video and text can be processed together effectively and elegantly by a single multimodal Transformer model. MTTR is end-to-end trainable, free of text-related inductive bias components and requires no additional…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

This New AI Can Find Your Dog In A Video! 🐩· youtube

Taxonomy

TopicsMultimodal Machine Learning Applications · Speech and dialogue systems · Advanced Neural Network Applications

MethodsLinear Layer · Adam · Dropout · Byte Pair Encoding · Position-Wise Feed-Forward Layer · Absolute Position Encodings · Label Smoothing · Softmax · Transformer · Detection Transformer