VLG-Net: Video-Language Graph Matching Network for Video Grounding

Mattia Soldan; Mengmeng Xu; Sisi Qu; Jesper Tegner; Bernard Ghanem

arXiv:2011.10132·cs.CV·August 17, 2021·5 cites

VLG-Net: Video-Language Graph Matching Network for Video Grounding

Mattia Soldan, Mengmeng Xu, Sisi Qu, Jesper Tegner, Bernard Ghanem

PDF

Open Access 1 Repo

TL;DR

VLG-Net introduces a graph matching approach using Graph Neural Networks to improve video grounding by aligning video and language representations, achieving superior results on multiple datasets.

Contribution

The paper presents a novel Video-Language Graph Matching Network that models intra- and inter-modality relationships for enhanced video grounding performance.

Findings

01

Outperforms state-of-the-art methods on ActivityNet-Captions, TACoS, and DiDeMo datasets.

02

Effectively models multi-modal interactions with graph-based representations.

03

Demonstrates the effectiveness of graph matching in temporal video localization.

Abstract

Grounding language queries in videos aims at identifying the time interval (or moment) semantically relevant to a language query. The solution to this challenging task demands understanding videos' and queries' semantic content and the fine-grained reasoning about their multi-modal interactions. Our key idea is to recast this challenge into an algorithmic graph matching problem. Fueled by recent advances in Graph Neural Networks, we propose to leverage Graph Convolutional Networks to model video and textual information as well as their semantic alignment. To enable the mutual exchange of information across the modalities, we design a novel Video-Language Graph Matching Network (VLG-Net) to match video and query graphs. Core ingredients include representation graphs built atop video snippets and query tokens separately and used to model intra-modality relationships. A Graph Matching…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Soldelli/VLG-Net
pytorch

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Video Analysis and Summarization

MethodsVideo Language Graph Matching Network · Graph Convolutional Networks