ViSIL: Unified Evaluation of Information Loss in Multimodal Video Captioning

Po-han Li; Shenghui Chen; Ufuk Topcu; Sandeep Chinchali

arXiv:2601.09851·cs.CV·January 28, 2026

ViSIL: Unified Evaluation of Information Loss in Multimodal Video Captioning

Po-han Li, Shenghui Chen, Ufuk Topcu, Sandeep Chinchali

PDF

Open Access 1 Datasets

TL;DR

This paper introduces ViSIL, a novel information-theoretic metric for evaluating multimodal video summaries, effectively quantifying information loss across diverse formats and correlating well with human and AI performance.

Contribution

The paper presents ViSIL, a unified, quantitative metric for assessing information coverage in multimodal video summaries, addressing limitations of traditional metrics.

Findings

01

ViSIL correlates significantly with human and VLM performance on VQA tasks.

02

ViSIL enables selection of summaries that optimize information retention and processing speed.

03

Outperforms text summaries by 7% in VQA accuracy without extra computational cost.

Abstract

Multimodal video captioning condenses dense footage into a structured format of keyframes and natural language. By creating a cohesive multimodal summary, this approach anchors generative AI in rich semantic evidence and serves as a lightweight proxy for high-efficiency retrieval. However, traditional metrics like BLEU or ROUGE fail to quantify information coverage across disparate modalities, such as comparing a paragraph of text to a sequence of keyframes. To address this, we propose the Video Summary Information Loss (ViSIL) score, an information-theoretic framework that quantifies the video information not captured by a summary via vision-language model (VLM) inference. By measuring the information loss, ViSIL is a unified metric that enables direct comparison across multimodal summary formats despite their structural discrepancies. Our results demonstrate that ViSIL scores show a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

Po-han/ViSIL_Multimodal-Video-Summary
dataset· 234 dl
234 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Video Analysis and Summarization · Topic Modeling