Modeling Motion with Multi-Modal Features for Text-Based Video   Segmentation

Wangbo Zhao; Kai Wang; Xiangxiang Chu; Fuzhao Xue; Xinchao Wang; Yang; You

arXiv:2204.02547·cs.CV·April 7, 2022

Modeling Motion with Multi-Modal Features for Text-Based Video Segmentation

Wangbo Zhao, Kai Wang, Xiangxiang Chu, Fuzhao Xue, Xinchao Wang, Yang, You

PDF

Open Access 1 Repo

TL;DR

This paper introduces a multi-modal transformer-based approach for text-based video segmentation that effectively fuses appearance, motion, and linguistic features to improve segmentation accuracy.

Contribution

It proposes a novel multi-modal video transformer and a language-guided feature fusion module, addressing the semantic gap between modalities for better segmentation.

Findings

01

Outperforms state-of-the-art methods on A2D Sentences and J-HMDB Sentences datasets.

02

Demonstrates strong generalization ability across different datasets.

03

Effectively fuses multi-modal features for accurate video segmentation.

Abstract

Text-based video segmentation aims to segment the target object in a video based on a describing sentence. Incorporating motion information from optical flow maps with appearance and linguistic modalities is crucial yet has been largely ignored by previous work. In this paper, we design a method to fuse and align appearance, motion, and linguistic features to achieve accurate segmentation. Specifically, we propose a multi-modal video transformer, which can fuse and aggregate multi-modal and temporal features between frames. Furthermore, we design a language-guided feature fusion module to progressively fuse appearance and motion features in each feature level with guidance from linguistic features. Finally, a multi-modal alignment loss is proposed to alleviate the semantic gap between features from different modalities. Extensive experiments on A2D Sentences and J-HMDB Sentences verify…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

wangbo-zhao/2022cvpr-mmmmtbvs
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Video Analysis and Summarization · Human Pose and Action Recognition

MethodsALIGN