A Video-grounded Dialogue Dataset and Metric for Event-driven Activities
Wiradee Imrattanatrai, Masaki Asada, Kimihiro Hasegawa, Zhi-Qi Cheng,, Ken Fukuda, Teruko Mitamura

TL;DR
This paper introduces VDAct, a challenging new dataset for video-grounded dialogue on event-driven activities, and VDEval, a novel evaluation metric that better aligns with human judgments by considering dialogue history and video summaries.
Contribution
The paper provides a new dataset with complex videos and dialogues, and a new evaluation metric that improves response assessment accuracy over existing methods.
Findings
State-of-the-art models struggle with complex question types.
VDEval correlates more strongly with human judgments.
VDAct offers diverse and challenging activity scenarios.
Abstract
This paper presents VDAct, a dataset for a Video-grounded Dialogue on Event-driven Activities, alongside VDEval, a session-based context evaluation metric specially designed for the task. Unlike existing datasets, VDAct includes longer and more complex video sequences that depict a variety of event-driven activities that require advanced contextual understanding for accurate response generation. The dataset comprises 3,000 dialogues with over 30,000 question-and-answer pairs, derived from 1,000 videos with diverse activity scenarios. VDAct displays a notably challenging characteristic due to its broad spectrum of activity scenarios and wide range of question types. Empirical studies on state-of-the-art vision foundation models highlight their limitations in addressing certain question types on our dataset. Furthermore, VDEval, which integrates dialogue session history and video content…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Speech and dialogue systems · Human Pose and Action Recognition
