QVHighlights: Detecting Moments and Highlights in Videos via Natural   Language Queries

Jie Lei; Tamara L. Berg; Mohit Bansal

arXiv:2107.09609·cs.CV·November 30, 2021·22 cites

QVHighlights: Detecting Moments and Highlights in Videos via Natural Language Queries

Jie Lei, Tamara L. Berg, Mohit Bansal

PDF

Open Access 4 Repos 1 Datasets

TL;DR

This paper introduces QVHighlights, a new dataset with over 10,000 videos and annotations for detecting video moments and highlights based on natural language queries, along with a transformer-based baseline model called Moment-DETR.

Contribution

The paper provides the first large-scale dataset for NL-based video moment detection and a novel transformer model that predicts moments directly, improving performance with weakly supervised pretraining.

Findings

01

Moment-DETR achieves competitive results without human priors.

02

Weakly supervised pretraining with ASR captions significantly improves performance.

03

The dataset enables diverse and flexible query-based highlight detection.

Abstract

Detecting customized moments and highlights from videos given natural language (NL) user queries is an important but under-studied topic. One of the challenges in pursuing this direction is the lack of annotated data. To address this issue, we present the Query-based Video Highlights (QVHIGHLIGHTS) dataset. It consists of over 10,000 YouTube videos, covering a wide range of topics, from everyday activities and travel in lifestyle vlog videos to social and political activities in news videos. Each video in the dataset is annotated with: (1) a human-written free-form NL query, (2) relevant moments in the video w.r.t. the query, and (3) five-point scale saliency scores for all query-relevant clips. This comprehensive annotation enables us to develop and evaluate systems that detect relevant moments as well as salient highlights for diverse, flexible user queries. We also present a strong…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Datasets

yaolily/TimeChat-Online-139K
dataset· 3.1k dl
3.1k dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Video Analysis and Summarization · Human Pose and Action Recognition

MethodsEmirates Airlines Office in Dubai