TL;DR
This paper introduces MIGCN, a novel graph convolutional network that models intra- and inter-modal relations to improve temporal language localization in videos, achieving better accuracy and efficiency.
Contribution
The work proposes a multi-modal interaction graph convolutional network with adaptive context-aware localization for more accurate video moment detection.
Findings
Outperforms existing methods on Charades-STA and ActivityNet datasets.
Effectively captures complex intra- and inter-modal relations.
Demonstrates superior efficiency in localization tasks.
Abstract
This paper focuses on tackling the problem of temporal language localization in videos, which aims to identify the start and end points of a moment described by a natural language sentence in an untrimmed video. However, it is non-trivial since it requires not only the comprehensive understanding of the video and sentence query, but also the accurate semantic correspondence capture between them. Existing efforts are mainly centered on exploring the sequential relation among video clips and query words to reason the video and sentence query, neglecting the other intra-modal relations (e.g., semantic similarity among video clips and syntactic dependency among the query words). Towards this end, in this work, we propose a Multi-modal Interaction Graph Convolutional Network (MIGCN), which jointly explores the complex intra-modal relations and inter-modal interactions residing in the video…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
