VSTAR: A Video-grounded Dialogue Dataset for Situated Semantic   Understanding with Scene and Topic Transitions

Yuxuan Wang; Zilong Zheng; Xueliang Zhao; Jinpeng Li; Yueqian Wang,; and Dongyan Zhao

arXiv:2305.18756·cs.CV·May 31, 2023·1 cites

VSTAR: A Video-grounded Dialogue Dataset for Situated Semantic Understanding with Scene and Topic Transitions

Yuxuan Wang, Zilong Zheng, Xueliang Zhao, Jinpeng Li, Yueqian Wang,, and Dongyan Zhao

PDF

Open Access 1 Repo

TL;DR

This paper introduces VSTAR, a large-scale dataset for video-grounded dialogue understanding focusing on scene and topic transitions, and provides benchmarks demonstrating the importance of multimodal segments.

Contribution

The paper presents the VSTAR dataset and benchmarks for scene segmentation, topic segmentation, and dialogue generation in video-grounded dialogues, highlighting the significance of scene and topic transitions.

Findings

01

Multimodal segments are crucial for dialogue understanding.

02

VSTAR dataset enables comprehensive evaluation of video-grounded dialogue tasks.

03

Experiments show the effectiveness of segment-aware models.

Abstract

Video-grounded dialogue understanding is a challenging problem that requires machine to perceive, parse and reason over situated semantics extracted from weakly aligned video and dialogues. Most existing benchmarks treat both modalities the same as a frame-independent visual understanding task, while neglecting the intrinsic attributes in multimodal dialogues, such as scene and topic transitions. In this paper, we present Video-grounded Scene&Topic AwaRe dialogue (VSTAR) dataset, a large scale video-grounded dialogue understanding dataset based on 395 TV series. Based on VSTAR, we propose two benchmarks for video-grounded dialogue understanding: scene segmentation and topic segmentation, and one benchmark for video-grounded dialogue generation. Comprehensive experiments are performed on these benchmarks to demonstrate the importance of multimodal information and segments in…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

patrick-tssn/VSTAR
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Video Analysis and Summarization

MethodsAttentive Walk-Aggregating Graph Neural Network