FIOVA: A Multi-Annotator Benchmark for Human-Aligned Video Captioning

Shiyu Hu; Xuchen Li; Xuzhao Li; Jing Zhang; Yipei Wang; Xin Zhao; Kang Hao Cheong

arXiv:2410.15270·cs.CV·May 20, 2025

FIOVA: A Multi-Annotator Benchmark for Human-Aligned Video Captioning

Shiyu Hu, Xuchen Li, Xuzhao Li, Jing Zhang, Yipei Wang, Xin Zhao, Kang Hao Cheong

PDF

Open Access

TL;DR

FIOVA introduces a multi-annotator, human-centric benchmark and a new event-level evaluation metric for assessing the alignment of video captioning models with human perception, addressing limitations of existing benchmarks.

Contribution

The paper presents FIOVA, a novel multi-annotator benchmark with a new evaluation metric, enabling detailed analysis of model alignment with human understanding in video captioning.

Findings

01

FIOVA captures semantic diversity and agreement among annotators.

02

Evaluation reveals gaps in model performance across complexity levels.

03

Structural issues like event under-description are identified.

Abstract

Despite rapid progress in large vision-language models (LVLMs), existing video caption benchmarks remain limited in evaluating their alignment with human understanding. Most rely on a single annotation per video and lexical similarity-based metrics, failing to capture the variability in human perception and the cognitive importance of events. These limitations hinder accurate diagnosis of model capabilities in producing coherent, complete, and human-aligned descriptions. To address this, we introduce FIOVA (Five-In-One Video Annotations), a human-centric benchmark tailored for evaluation. It comprises 3,002 real-world videos (about 33.6s each), each annotated independently by five annotators. This design enables modeling of semantic diversity and inter-subjective agreement, offering a richer foundation for measuring human-machine alignment. We further propose FIOVA-DQ, an event-level…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Anomaly Detection Techniques and Applications · Multimodal Machine Learning Applications