Saying the Unseen: Video Descriptions via Dialog Agents
Ye Zhu, Yu Wu, Yi Yang, Yan Yan

TL;DR
This paper introduces a novel task where two dialog agents collaboratively describe a video with incomplete visual data, using natural language questions and answers to compensate for missing visual information, enhancing security and transparency.
Contribution
The paper proposes a new vision-language task involving dialog agents to describe videos with limited visual data, and introduces QA-Cooperative networks for effective knowledge transfer.
Findings
Dialog agents successfully supplement incomplete visual data with natural language.
QA-Cooperative networks enable effective knowledge transfer between agents.
The approach improves video description accuracy under visual data constraints.
Abstract
Current vision and language tasks usually take complete visual data (e.g., raw images or videos) as input, however, practical scenarios may often consist the situations where part of the visual information becomes inaccessible due to various reasons e.g., restricted view with fixed camera or intentional vision block for security concerns. As a step towards the more practical application scenarios, we introduce a novel task that aims to describe a video using the natural language dialog between two agents as a supplementary information source given incomplete visual data. Different from most existing vision-language tasks where AI systems have full access to images or video clips, which may reveal sensitive information such as recognizable human faces or voices, we intentionally limit the visual input for AI systems and seek a more secure and transparent information medium, i.e., the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Topic Modeling
