ProactiveVideoQA: A Comprehensive Benchmark Evaluating Proactive Interactions in Video Large Language Models

Yueqian Wang; Xiaojun Meng; Yifan Wang; Huishuai Zhang; Dongyan Zhao

arXiv:2507.09313·cs.CV·July 16, 2025

ProactiveVideoQA: A Comprehensive Benchmark Evaluating Proactive Interactions in Video Large Language Models

Yueqian Wang, Xiaojun Meng, Yifan Wang, Huishuai Zhang, Dongyan Zhao

PDF

Open Access 1 Datasets

TL;DR

ProactiveVideoQA introduces a new benchmark and metric for evaluating multimodal video systems' ability to proactively interact in real-time, emphasizing temporal response accuracy and user experience.

Contribution

It presents the first comprehensive benchmark and a novel temporal-aware metric, PAUC, for assessing proactive interaction in multimodal video dialogue systems.

Findings

01

PAUC aligns better with human preferences than traditional metrics.

02

Baseline systems show room for improvement in proactive interaction capabilities.

03

The benchmark facilitates future research in proactive multimodal dialogue systems.

Abstract

With the growing research focus on multimodal dialogue systems, the capability for proactive interaction is gradually gaining recognition. As an alternative to conventional turn-by-turn dialogue, users increasingly expect multimodal systems to be more initiative, for example, by autonomously determining the timing of multi-turn responses in real time during video playback. To facilitate progress in this emerging area, we introduce ProactiveVideoQA, the first comprehensive benchmark to evaluate a system's ability to engage in proactive interaction. Since model responses are generated at varying timestamps, we further propose PAUC, the first metric that accounts for the temporal dynamics of model responses. This enables a more accurate evaluation of systems operating in proactive settings. Through extensive benchmarking of various baseline systems on ProactiveVideoQA and a user study of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

wangyueqian/ProactiveVideoQA
dataset· 103 dl
103 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Topic Modeling