QVHighlights: Detecting Moments and Highlights in Videos via Natural Language Queries
Jie Lei, Tamara L. Berg, Mohit Bansal

TL;DR
This paper introduces QVHighlights, a new dataset with over 10,000 videos and annotations for detecting video moments and highlights based on natural language queries, along with a transformer-based baseline model called Moment-DETR.
Contribution
The paper provides the first large-scale dataset for NL-based video moment detection and a novel transformer model that predicts moments directly, improving performance with weakly supervised pretraining.
Findings
Moment-DETR achieves competitive results without human priors.
Weakly supervised pretraining with ASR captions significantly improves performance.
The dataset enables diverse and flexible query-based highlight detection.
Abstract
Detecting customized moments and highlights from videos given natural language (NL) user queries is an important but under-studied topic. One of the challenges in pursuing this direction is the lack of annotated data. To address this issue, we present the Query-based Video Highlights (QVHIGHLIGHTS) dataset. It consists of over 10,000 YouTube videos, covering a wide range of topics, from everyday activities and travel in lifestyle vlog videos to social and political activities in news videos. Each video in the dataset is annotated with: (1) a human-written free-form NL query, (2) relevant moments in the video w.r.t. the query, and (3) five-point scale saliency scores for all query-relevant clips. This comprehensive annotation enables us to develop and evaluate systems that detect relevant moments as well as salient highlights for diverse, flexible user queries. We also present a strong…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Video Analysis and Summarization · Human Pose and Action Recognition
MethodsEmirates Airlines Office in Dubai
