Describing Unseen Videos via Multi-Modal Cooperative Dialog Agents
Ye Zhu, Yu Wu, Yi Yang, and Yan Yan

TL;DR
This paper introduces a novel multi-modal cooperative dialog task where one agent describes an unseen video based on limited static frames and dialog, with a focus on knowledge transfer and improved video description.
Contribution
It proposes a new task and a QA-Cooperative Network with dynamic dialog update, enabling one agent to effectively describe unseen videos through cooperative learning.
Findings
Q-BOT effectively learns to describe unseen videos.
The model achieves promising performance with full dialog history.
Cooperative learning improves video description accuracy.
Abstract
With the arising concerns for the AI systems provided with direct access to abundant sensitive information, researchers seek to develop more reliable AI with implicit information sources. To this end, in this paper, we introduce a new task called video description via two multi-modal cooperative dialog agents, whose ultimate goal is for one conversational agent to describe an unseen video based on the dialog and two static frames. Specifically, one of the intelligent agents - Q-BOT - is given two static frames from the beginning and the end of the video, as well as a finite number of opportunities to ask relevant natural language questions before describing the unseen video. A-BOT, the other agent who has already seen the entire video, assists Q-BOT to accomplish the goal by providing answers to those questions. We propose a QA-Cooperative Network with a dynamic dialog history update…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Domain Adaptation and Few-Shot Learning
