A Video-grounded Dialogue Dataset and Metric for Event-driven Activities

Wiradee Imrattanatrai; Masaki Asada; Kimihiro Hasegawa; Zhi-Qi Cheng,; Ken Fukuda; Teruko Mitamura

arXiv:2501.18324·cs.CV·January 31, 2025

A Video-grounded Dialogue Dataset and Metric for Event-driven Activities

Wiradee Imrattanatrai, Masaki Asada, Kimihiro Hasegawa, Zhi-Qi Cheng,, Ken Fukuda, Teruko Mitamura

PDF

Open Access 1 Repo

TL;DR

This paper introduces VDAct, a challenging new dataset for video-grounded dialogue on event-driven activities, and VDEval, a novel evaluation metric that better aligns with human judgments by considering dialogue history and video summaries.

Contribution

The paper provides a new dataset with complex videos and dialogues, and a new evaluation metric that improves response assessment accuracy over existing methods.

Findings

01

State-of-the-art models struggle with complex question types.

02

VDEval correlates more strongly with human judgments.

03

VDAct offers diverse and challenging activity scenarios.

Abstract

This paper presents VDAct, a dataset for a Video-grounded Dialogue on Event-driven Activities, alongside VDEval, a session-based context evaluation metric specially designed for the task. Unlike existing datasets, VDAct includes longer and more complex video sequences that depict a variety of event-driven activities that require advanced contextual understanding for accurate response generation. The dataset comprises 3,000 dialogues with over 30,000 question-and-answer pairs, derived from 1,000 videos with diverse activity scenarios. VDAct displays a notably challenging characteristic due to its broad spectrum of activity scenarios and wide range of question types. Empirical studies on state-of-the-art vision foundation models highlight their limitations in addressing certain question types on our dataset. Furthermore, VDEval, which integrates dialogue session history and video content…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

aistairc/VDAct
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Speech and dialogue systems · Human Pose and Action Recognition