End-to-end Multi-modal Video Temporal Grounding

Yi-Wen Chen; Yi-Hsuan Tsai; Ming-Hsuan Yang

arXiv:2107.05624·cs.CV·November 1, 2021·21 cites

End-to-end Multi-modal Video Temporal Grounding

Yi-Wen Chen, Yi-Hsuan Tsai, Ming-Hsuan Yang

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper introduces a multi-modal framework for text-guided video temporal grounding that leverages RGB, optical flow, and depth data with transformer-based fusion and self-supervised learning, achieving superior results.

Contribution

It presents a novel multi-modal approach with dynamic transformer fusion and intra-modal self-supervised learning for improved video temporal grounding.

Findings

01

Outperforms state-of-the-art methods on Charades-STA and ActivityNet Captions datasets.

02

Effectively integrates RGB, optical flow, and depth modalities for better event localization.

03

Enhances feature representations through intra-modal self-supervised learning.

Abstract

We address the problem of text-guided video temporal grounding, which aims to identify the time interval of a certain event based on a natural language description. Different from most existing methods that only consider RGB images as visual features, we propose a multi-modal framework to extract complementary information from videos. Specifically, we adopt RGB images for appearance, optical flow for motion, and depth maps for image structure. While RGB images provide abundant visual cues of certain events, the performance may be affected by background clutters. Therefore, we use optical flow to focus on large motion and depth maps to infer the scene configuration when the action is related to objects recognizable with their shapes. To integrate the three modalities more effectively and enable inter-modal learning, we design a dynamic fusion scheme with transformers to model the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

wenz116/drft
pytorchOfficial

Videos

End-to-end Multi-modal Video Temporal Grounding· slideslive

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Video Analysis and Summarization