ViSIL: Unified Evaluation of Information Loss in Multimodal Video Captioning
Po-han Li, Shenghui Chen, Ufuk Topcu, Sandeep Chinchali

TL;DR
This paper introduces ViSIL, a novel information-theoretic metric for evaluating multimodal video summaries, effectively quantifying information loss across diverse formats and correlating well with human and AI performance.
Contribution
The paper presents ViSIL, a unified, quantitative metric for assessing information coverage in multimodal video summaries, addressing limitations of traditional metrics.
Findings
ViSIL correlates significantly with human and VLM performance on VQA tasks.
ViSIL enables selection of summaries that optimize information retention and processing speed.
Outperforms text summaries by 7% in VQA accuracy without extra computational cost.
Abstract
Multimodal video captioning condenses dense footage into a structured format of keyframes and natural language. By creating a cohesive multimodal summary, this approach anchors generative AI in rich semantic evidence and serves as a lightweight proxy for high-efficiency retrieval. However, traditional metrics like BLEU or ROUGE fail to quantify information coverage across disparate modalities, such as comparing a paragraph of text to a sequence of keyframes. To address this, we propose the Video Summary Information Loss (ViSIL) score, an information-theoretic framework that quantifies the video information not captured by a summary via vision-language model (VLM) inference. By measuring the information loss, ViSIL is a unified metric that enables direct comparison across multimodal summary formats despite their structural discrepancies. Our results demonstrate that ViSIL scores show a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Video Analysis and Summarization · Topic Modeling
