End-to-End Referring Video Object Segmentation with Multimodal Transformers
Adam Botach, Evgenii Zheltonozhskii, Chaim Baskin

TL;DR
This paper introduces MTTR, a simple end-to-end multimodal Transformer model for referring video object segmentation that outperforms previous methods and simplifies the pipeline by avoiding complex components.
Contribution
The paper presents a novel Transformer-based framework for RVOS that models the task as sequence prediction, eliminating the need for complex pipelines and post-processing.
Findings
Significantly outperforms previous methods on standard benchmarks.
Achieves +5.7 and +5.0 mAP gains on A2D-Sentences and JHMDB-Sentences datasets.
Processes 76 frames per second, demonstrating efficiency.
Abstract
The referring video object segmentation task (RVOS) involves segmentation of a text-referred object instance in the frames of a given video. Due to the complex nature of this multimodal task, which combines text reasoning, video understanding, instance segmentation and tracking, existing approaches typically rely on sophisticated pipelines in order to tackle it. In this paper, we propose a simple Transformer-based approach to RVOS. Our framework, termed Multimodal Tracking Transformer (MTTR), models the RVOS task as a sequence prediction problem. Following recent advancements in computer vision and natural language processing, MTTR is based on the realization that video and text can be processed together effectively and elegantly by a single multimodal Transformer model. MTTR is end-to-end trainable, free of text-related inductive bias components and requires no additional…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
This New AI Can Find Your Dog In A Video! 🐩· youtube
Taxonomy
TopicsMultimodal Machine Learning Applications · Speech and dialogue systems · Advanced Neural Network Applications
MethodsLinear Layer · Adam · Dropout · Byte Pair Encoding · Position-Wise Feed-Forward Layer · Absolute Position Encodings · Label Smoothing · Softmax · Transformer · Detection Transformer
