DeVAn: Dense Video Annotation for Video-Language Models
Tingkai Liu, Yunzhe Tao, Haogeng Liu, Qihang Fan, Ding Zhou, Huaibo, Huang, Ran He, Hongxia Yang

TL;DR
DeVAn is a new dataset with human-annotated video descriptions and summaries for evaluating visual-language models on real-world videos, including captioning, summarization, and retrieval tasks.
Contribution
It introduces DeVAn, a comprehensive dataset with dense annotations for videos, and benchmarks current models on captioning, summarization, and retrieval, emphasizing human-aligned evaluation metrics.
Findings
Model-based metrics align better with human preferences.
Current models show room for improvement in dense video captioning.
DeVAn provides a standardized benchmark for future video-language research.
Abstract
We present a novel human annotated dataset for evaluating the ability for visual-language models to generate both short and long descriptions for real-world video clips, termed DeVAn (Dense Video Annotation). The dataset contains 8.5K YouTube video clips of 20-60 seconds in duration and covers a wide range of topics and interests. Each video clip is independently annotated by 5 human annotators, producing both captions (1 sentence) and summaries (3-10 sentences). Given any video selected from the dataset and its corresponding ASR information, we evaluate visuallanguage models on either caption or summary generation that is grounded in both the visual and auditory content of the video. Additionally, models are also evaluated on caption- and summary-based retrieval tasks, where the summary-based retrieval task requires the identification of a target video given excerpts of a given…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Video Analysis and Summarization · Natural Language Processing Techniques
MethodsContrastive Language-Image Pre-training
