VSTAR: A Video-grounded Dialogue Dataset for Situated Semantic Understanding with Scene and Topic Transitions
Yuxuan Wang, Zilong Zheng, Xueliang Zhao, Jinpeng Li, Yueqian Wang,, and Dongyan Zhao

TL;DR
This paper introduces VSTAR, a large-scale dataset for video-grounded dialogue understanding focusing on scene and topic transitions, and provides benchmarks demonstrating the importance of multimodal segments.
Contribution
The paper presents the VSTAR dataset and benchmarks for scene segmentation, topic segmentation, and dialogue generation in video-grounded dialogues, highlighting the significance of scene and topic transitions.
Findings
Multimodal segments are crucial for dialogue understanding.
VSTAR dataset enables comprehensive evaluation of video-grounded dialogue tasks.
Experiments show the effectiveness of segment-aware models.
Abstract
Video-grounded dialogue understanding is a challenging problem that requires machine to perceive, parse and reason over situated semantics extracted from weakly aligned video and dialogues. Most existing benchmarks treat both modalities the same as a frame-independent visual understanding task, while neglecting the intrinsic attributes in multimodal dialogues, such as scene and topic transitions. In this paper, we present Video-grounded Scene&Topic AwaRe dialogue (VSTAR) dataset, a large scale video-grounded dialogue understanding dataset based on 395 TV series. Based on VSTAR, we propose two benchmarks for video-grounded dialogue understanding: scene segmentation and topic segmentation, and one benchmark for video-grounded dialogue generation. Comprehensive experiments are performed on these benchmarks to demonstrate the importance of multimodal information and segments in…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Video Analysis and Summarization
MethodsAttentive Walk-Aggregating Graph Neural Network
