Local-Global Video-Text Interactions for Temporal Grounding

Jonghwan Mun; Minsu Cho; Bohyung Han

arXiv:2004.07514·cs.CV·April 17, 2020·27 cites

Local-Global Video-Text Interactions for Temporal Grounding

Jonghwan Mun, Minsu Cho, Bohyung Han

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper introduces a regression-based model that leverages local and global bi-modal interactions to improve the accuracy of text-to-video temporal grounding, significantly outperforming previous methods on benchmark datasets.

Contribution

The paper proposes a novel regression-based approach that captures multi-level local and global interactions between video and text features for better temporal grounding.

Findings

01

Outperforms state-of-the-art on Charades-STA and ActivityNet Captions datasets.

02

Incorporating both local and global context is crucial for accurate grounding.

03

Model achieves 7.44% and 4.61% improvements at Recall@tIoU=0.5.

Abstract

This paper addresses the problem of text-to-video temporal grounding, which aims to identify the time interval in a video semantically relevant to a text query. We tackle this problem using a novel regression-based model that learns to extract a collection of mid-level features for semantic phrases in a text query, which corresponds to important semantic entities described in the query (e.g., actors, objects, and actions), and reflect bi-modal interactions between the linguistic features of the query and the visual features of the video in multiple levels. The proposed method effectively predicts the target time interval by exploiting contextual information from local to global during bi-modal interactions. Through in-depth ablation studies, we find out that incorporating both local and global context in video and text interactions is crucial to the accurate grounding. Our experiment…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

JonghwanMun/LGI4temporalgrounding
pytorchOfficial

Videos

Local-Global Video-Text Interactions for Temporal Grounding· youtube

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Video Analysis and Summarization