A Benchmarking Methodology to Assess Open-Source Video Large Language Models in Automatic Captioning of News Videos
David Miranda Paredes, Jose M. Saavedra, Marcelo Pizarro

TL;DR
This paper evaluates eight open-source Video Large Language Models for automatic captioning of news videos using standard and novel fidelity metrics, revealing strengths and limitations in current models.
Contribution
It introduces two new fidelity metrics, TFS and EFS, to better assess thematic and entity coverage in news video captioning models.
Findings
Gemma 3 achieves the highest performance across datasets.
Standard metrics have limited discriminative power for news captioning.
TFS and EFS effectively measure thematic and entity fidelity.
Abstract
News videos are among the most prevalent content types produced by television stations and online streaming platforms, yet generating textual descriptions to facilitate indexing and retrieval largely remains a manual process. Video Large Language Models (VidLLMs) offer significant potential to automate this task, but a comprehensive evaluation in the news domain is still lacking. This work presents a comparative study of eight state-of-the-art open-source VidLLMs for automatic news video captioning, evaluated on two complementary benchmark datasets: a Chilean TV news corpus (approximately 1,345 clips) and a BBC News corpus (9,838 clips). We employ lexical metrics (METEOR, ROUGE-L), semantic metrics (BERTScore, CLIPScore, Text Similarity, Mean Reciprocal Rank), and two novel fidelity metrics proposed in this work: the Thematic Fidelity Score (TFS) and Entity Fidelity Score (EFS). Our…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
