Local-Global Video-Text Interactions for Temporal Grounding
Jonghwan Mun, Minsu Cho, Bohyung Han

TL;DR
This paper introduces a regression-based model that leverages local and global bi-modal interactions to improve the accuracy of text-to-video temporal grounding, significantly outperforming previous methods on benchmark datasets.
Contribution
The paper proposes a novel regression-based approach that captures multi-level local and global interactions between video and text features for better temporal grounding.
Findings
Outperforms state-of-the-art on Charades-STA and ActivityNet Captions datasets.
Incorporating both local and global context is crucial for accurate grounding.
Model achieves 7.44% and 4.61% improvements at Recall@tIoU=0.5.
Abstract
This paper addresses the problem of text-to-video temporal grounding, which aims to identify the time interval in a video semantically relevant to a text query. We tackle this problem using a novel regression-based model that learns to extract a collection of mid-level features for semantic phrases in a text query, which corresponds to important semantic entities described in the query (e.g., actors, objects, and actions), and reflect bi-modal interactions between the linguistic features of the query and the visual features of the video in multiple levels. The proposed method effectively predicts the target time interval by exploiting contextual information from local to global during bi-modal interactions. Through in-depth ablation studies, we find out that incorporating both local and global context in video and text interactions is crucial to the accurate grounding. Our experiment…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Local-Global Video-Text Interactions for Temporal Grounding· youtube
Taxonomy
TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Video Analysis and Summarization
