VERIFIED: A Video Corpus Moment Retrieval Benchmark for Fine-Grained   Video Understanding

Houlun Chen; Xin Wang; Hong Chen; Zeyang Zhang; Wei Feng; Bin Huang,; Jia Jia; Wenwu Zhu

arXiv:2410.08593·cs.CV·October 14, 2024

VERIFIED: A Video Corpus Moment Retrieval Benchmark for Fine-Grained Video Understanding

Houlun Chen, Xin Wang, Hong Chen, Zeyang Zhang, Wei Feng, Bin Huang,, Jia Jia, Wenwu Zhu

PDF

Open Access 1 Repo

TL;DR

This paper introduces VERIFIED, a new fine-grained Video Corpus Moment Retrieval benchmark that uses an automatic annotation pipeline with large language and multimodal models to generate high-quality, detailed video captions for improved localization accuracy.

Contribution

The paper presents VERIFIED, a novel automatic annotation pipeline leveraging LLMs and LMMs for creating a challenging fine-grained VCMR dataset with high-quality annotations.

Findings

01

State-of-the-art models perform significantly better on coarse-grained tasks

02

The new benchmark reveals gaps in current fine-grained video understanding methods

03

VERIFIED dataset improves the evaluation of precise video moment localization

Abstract

Existing Video Corpus Moment Retrieval (VCMR) is limited to coarse-grained understanding, which hinders precise video moment localization when given fine-grained queries. In this paper, we propose a more challenging fine-grained VCMR benchmark requiring methods to localize the best-matched moment from the corpus with other partially matched candidates. To improve the dataset construction efficiency and guarantee high-quality data annotations, we propose VERIFIED, an automatic \underline{V}id\underline{E}o-text annotation pipeline to generate captions with \underline{R}el\underline{I}able \underline{FI}n\underline{E}-grained statics and \underline{D}ynamics. Specifically, we resort to large language models (LLM) and large multimodal models (LMM) with our proposed Statics and Dynamics Enhanced Captioning modules to generate diverse fine-grained captions for each video. To filter out the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

hlchen23/verified
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Video Analysis and Summarization · Human Pose and Action Recognition

MethodsAttentive Walk-Aggregating Graph Neural Network