Natural Language Video Localization: A Revisit in Span-based Question Answering Framework
Hao Zhang, Aixin Sun, Wei Jing, Liangli Zhen, Joey Tianyi Zhou, Rick, Siow Mong Goh

TL;DR
This paper introduces a span-based question answering framework for natural language video localization, improving accuracy especially on long videos by using a multi-scale split-and-concatenate strategy.
Contribution
It proposes VSLNet, a novel span-based QA model with query-guided highlighting, and extends it to VSLNet-L to handle long videos effectively.
Findings
VSLNet outperforms existing methods on benchmark datasets.
VSLNet-L effectively mitigates performance degradation on long videos.
Span-based QA is a promising approach for NLVL tasks.
Abstract
Natural Language Video Localization (NLVL) aims to locate a target moment from an untrimmed video that semantically corresponds to a text query. Existing approaches mainly solve the NLVL problem from the perspective of computer vision by formulating it as ranking, anchor, or regression tasks. These methods suffer from large performance degradation when localizing on long videos. In this work, we address the NLVL from a new perspective, i.e., span-based question answering (QA), by treating the input video as a text passage. We propose a video span localizing network (VSLNet), on top of the standard span-based QA framework (named VSLBase), to address NLVL. VSLNet tackles the differences between NLVL and span-based QA through a simple yet effective query-guided highlighting (QGH) strategy. QGH guides VSLNet to search for the matching video span within a highlighted region. To address the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
