TL;DR
This paper introduces the Moment Context Network (MCN) for localizing specific video segments based on natural language descriptions, addressing the lack of suitable datasets by creating DiDeMo, a large annotated video dataset.
Contribution
The paper presents MCN, a novel model for temporal localization in videos using language, and introduces DiDeMo, a new dataset with over 10,000 annotated video segments.
Findings
MCN outperforms baseline methods in localizing video moments.
DiDeMo dataset enables training and evaluation of localization models.
Initial results suggest promising directions for future research.
Abstract
We consider retrieving a specific temporal segment, or moment, from a video given a natural language text description. Methods designed to retrieve whole video clips with natural language determine what occurs in a video but not when. To address this issue, we propose the Moment Context Network (MCN) which effectively localizes natural language queries in videos by integrating local and global video features over time. A key obstacle to training our MCN model is that current video datasets do not include pairs of localized video segments and referring expressions, or text descriptions which uniquely identify a corresponding moment. Therefore, we collect the Distinct Describable Moments (DiDeMo) dataset which consists of over 10,000 unedited, personal videos in diverse visual settings with pairs of localized video segments and referring expressions. We demonstrate that MCN outperforms…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
