End-to-end Multi-modal Video Temporal Grounding
Yi-Wen Chen, Yi-Hsuan Tsai, Ming-Hsuan Yang

TL;DR
This paper introduces a multi-modal framework for text-guided video temporal grounding that leverages RGB, optical flow, and depth data with transformer-based fusion and self-supervised learning, achieving superior results.
Contribution
It presents a novel multi-modal approach with dynamic transformer fusion and intra-modal self-supervised learning for improved video temporal grounding.
Findings
Outperforms state-of-the-art methods on Charades-STA and ActivityNet Captions datasets.
Effectively integrates RGB, optical flow, and depth modalities for better event localization.
Enhances feature representations through intra-modal self-supervised learning.
Abstract
We address the problem of text-guided video temporal grounding, which aims to identify the time interval of a certain event based on a natural language description. Different from most existing methods that only consider RGB images as visual features, we propose a multi-modal framework to extract complementary information from videos. Specifically, we adopt RGB images for appearance, optical flow for motion, and depth maps for image structure. While RGB images provide abundant visual cues of certain events, the performance may be affected by background clutters. Therefore, we use optical flow to focus on large motion and depth maps to infer the scene configuration when the action is related to objects recognizable with their shapes. To integrate the three modalities more effectively and enable inter-modal learning, we design a dynamic fusion scheme with transformers to model the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Video Analysis and Summarization
