Game-Based Video-Context Dialogue
Ramakanth Pasunuru, Mohit Bansal

TL;DR
This paper introduces a new multimodal dialogue dataset based on live soccer videos and chats, enabling the development of models that generate contextually relevant dialogue grounded in dynamic visual and conversational data.
Contribution
It presents a novel video-context, multi-speaker dialogue dataset and baseline models for visually-grounded dialogue in dynamic, real-world scenarios.
Findings
Models can generate relevant dialogue from video and chat context.
The dataset enables evaluation of multimodal dialogue systems.
Baseline models show promising results with room for improvement.
Abstract
Current dialogue systems focus more on textual and speech context knowledge and are usually based on two speakers. Some recent work has investigated static image-based dialogue. However, several real-world human interactions also involve dynamic visual context (similar to videos) as well as dialogue exchanges among multiple speakers. To move closer towards such multimodal conversational skills and visually-situated applications, we introduce a new video-context, many-speaker dialogue dataset based on live-broadcast soccer game videos and chats from Twitch.tv. This challenging testbed allows us to develop visually-grounded dialogue models that should generate relevant temporal and spatial event language from the live video, while also being relevant to the chat history. For strong baselines, we also present several discriminative and generative models, e.g., based on tridirectional…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Speech and dialogue systems
