VERIFIED: A Video Corpus Moment Retrieval Benchmark for Fine-Grained Video Understanding
Houlun Chen, Xin Wang, Hong Chen, Zeyang Zhang, Wei Feng, Bin Huang,, Jia Jia, Wenwu Zhu

TL;DR
This paper introduces VERIFIED, a new fine-grained Video Corpus Moment Retrieval benchmark that uses an automatic annotation pipeline with large language and multimodal models to generate high-quality, detailed video captions for improved localization accuracy.
Contribution
The paper presents VERIFIED, a novel automatic annotation pipeline leveraging LLMs and LMMs for creating a challenging fine-grained VCMR dataset with high-quality annotations.
Findings
State-of-the-art models perform significantly better on coarse-grained tasks
The new benchmark reveals gaps in current fine-grained video understanding methods
VERIFIED dataset improves the evaluation of precise video moment localization
Abstract
Existing Video Corpus Moment Retrieval (VCMR) is limited to coarse-grained understanding, which hinders precise video moment localization when given fine-grained queries. In this paper, we propose a more challenging fine-grained VCMR benchmark requiring methods to localize the best-matched moment from the corpus with other partially matched candidates. To improve the dataset construction efficiency and guarantee high-quality data annotations, we propose VERIFIED, an automatic \underline{V}id\underline{E}o-text annotation pipeline to generate captions with \underline{R}el\underline{I}able \underline{FI}n\underline{E}-grained statics and \underline{D}ynamics. Specifically, we resort to large language models (LLM) and large multimodal models (LMM) with our proposed Statics and Dynamics Enhanced Captioning modules to generate diverse fine-grained captions for each video. To filter out the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Video Analysis and Summarization · Human Pose and Action Recognition
MethodsAttentive Walk-Aggregating Graph Neural Network
