Saying the Unseen: Video Descriptions via Dialog Agents

Ye Zhu; Yu Wu; Yi Yang; Yan Yan

arXiv:2106.14069·cs.CV·June 29, 2021

Saying the Unseen: Video Descriptions via Dialog Agents

Ye Zhu, Yu Wu, Yi Yang, Yan Yan

PDF

Open Access 1 Repo

TL;DR

This paper introduces a novel task where two dialog agents collaboratively describe a video with incomplete visual data, using natural language questions and answers to compensate for missing visual information, enhancing security and transparency.

Contribution

The paper proposes a new vision-language task involving dialog agents to describe videos with limited visual data, and introduces QA-Cooperative networks for effective knowledge transfer.

Findings

01

Dialog agents successfully supplement incomplete visual data with natural language.

02

QA-Cooperative networks enable effective knowledge transfer between agents.

03

The approach improves video description accuracy under visual data constraints.

Abstract

Current vision and language tasks usually take complete visual data (e.g., raw images or videos) as input, however, practical scenarios may often consist the situations where part of the visual information becomes inaccessible due to various reasons e.g., restricted view with fixed camera or intentional vision block for security concerns. As a step towards the more practical application scenarios, we introduce a novel task that aims to describe a video using the natural language dialog between two agents as a supplementary information source given incomplete visual data. Different from most existing vision-language tasks where AI systems have full access to images or video clips, which may reveal sensitive information such as recognizable human faces or voices, we intentionally limit the visual input for AI systems and seek a more secure and transparent information medium, i.e., the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

L-YeZhu/Video-Description-via-Dialog-Agents-ECCV2020
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Topic Modeling