DeVAn: Dense Video Annotation for Video-Language Models

Tingkai Liu; Yunzhe Tao; Haogeng Liu; Qihang Fan; Ding Zhou; Huaibo; Huang; Ran He; Hongxia Yang

arXiv:2310.05060·cs.CV·August 12, 2024

DeVAn: Dense Video Annotation for Video-Language Models

Tingkai Liu, Yunzhe Tao, Haogeng Liu, Qihang Fan, Ding Zhou, Huaibo, Huang, Ran He, Hongxia Yang

PDF

Open Access 1 Repo 1 Video

TL;DR

DeVAn is a new dataset with human-annotated video descriptions and summaries for evaluating visual-language models on real-world videos, including captioning, summarization, and retrieval tasks.

Contribution

It introduces DeVAn, a comprehensive dataset with dense annotations for videos, and benchmarks current models on captioning, summarization, and retrieval, emphasizing human-aligned evaluation metrics.

Findings

01

Model-based metrics align better with human preferences.

02

Current models show room for improvement in dense video captioning.

03

DeVAn provides a standardized benchmark for future video-language research.

Abstract

We present a novel human annotated dataset for evaluating the ability for visual-language models to generate both short and long descriptions for real-world video clips, termed DeVAn (Dense Video Annotation). The dataset contains 8.5K YouTube video clips of 20-60 seconds in duration and covers a wide range of topics and interests. Each video clip is independently annotated by 5 human annotators, producing both captions (1 sentence) and summaries (3-10 sentences). Given any video selected from the dataset and its corresponding ASR information, we evaluate visuallanguage models on either caption or summary generation that is grounded in both the visual and auditory content of the video. Additionally, models are also evaluated on caption- and summary-based retrieval tasks, where the summary-based retrieval task requires the identification of a target video given excerpts of a given…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

tk-21st/devan
pytorchOfficial

Videos

DeVAn: Dense Video Annotation for Video-Language Models· underline

Taxonomy

TopicsMultimodal Machine Learning Applications · Video Analysis and Summarization · Natural Language Processing Techniques

MethodsContrastive Language-Image Pre-training