QEVA: A Reference-Free Evaluation Metric for Narrative Video Summarization with Multimodal Question Answering

Woojun Jung; Junyeong Kim

arXiv:2604.24052·cs.CV·April 28, 2026

QEVA: A Reference-Free Evaluation Metric for Narrative Video Summarization with Multimodal Question Answering

Woojun Jung, Junyeong Kim

PDF

TL;DR

QEVA is a novel reference-free evaluation metric for narrative video summarization that uses multimodal question answering to assess summaries without relying on human references.

Contribution

The paper introduces QEVA, a new multimodal question answering-based metric and a benchmark dataset for more effective evaluation of video summaries.

Findings

01

QEVA correlates better with human judgments than existing metrics.

02

The MLVU(VS)-Eval benchmark provides a transparent framework for evaluation.

03

QEVA evaluates summaries on Coverage, Factuality, and Chronology.

Abstract

Video-to-text summarization remains underexplored in terms of comprehensive evaluation methods. Traditional n-gram overlap-based metrics and recent large language model (LLM)-based approaches depend heavily on human-written reference summaries, limiting their practicality and sensitivity to nuanced semantic aspects. In this paper, we propose QEVA, a reference-free metric evaluating candidate summaries directly against source videos through multimodal question answering. QEVA assesses summaries along three clear dimensions: Coverage, Factuality, and Chronology. We also introduce MLVU(VS)-Eval, a new annotated benchmark derived from the MLVU dataset, comprising 800 summaries generated from 200 videos using state-of-the-art video-language multimodal models. This dataset establishes a transparent and consistent framework for evaluation. Experimental results demonstrate that QEVA shows…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.