Video Dialog as Conversation about Objects Living in Space-Time
Hoang-Anh Pham, Thao Minh Le, Vuong Le, Tu Minh Phuong, Truyen Tran

TL;DR
This paper introduces COST, a novel object-centric framework for video dialog that enables high-level reasoning about space-time visual content, improving conversational understanding about videos.
Contribution
The paper presents COST, a new neural reasoning framework that tracks object trajectories and dialog states for enhanced video-based conversational AI.
Findings
COST achieves competitive results on DSTC7 and DSTC8 benchmarks.
Object trajectory parsing improves reasoning over video content.
Maintaining dialog and object states enhances answer relevance.
Abstract
It would be a technological feat to be able to create a system that can hold a meaningful conversation with humans about what they watch. A setup toward that goal is presented as a video dialog task, where the system is asked to generate natural utterances in response to a question in an ongoing dialog. The task poses great visual, linguistic, and reasoning challenges that cannot be easily overcome without an appropriate representation scheme over video and dialog that supports high-level reasoning. To tackle these challenges we present a new object-centric framework for video dialog that supports neural reasoning dubbed COST - which stands for Conversation about Objects in Space-Time. Here dynamic space-time visual content in videos is first parsed into object trajectories. Given this video abstraction, COST maintains and tracks object-associated dialog states, which are updated upon…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Topic Modeling
