Audio-Visual Scene-Aware Dialog and Reasoning using Audio-Visual   Transformers with Joint Student-Teacher Learning

Ankit P. Shah; Shijie Geng; Peng Gao; Anoop Cherian; Takaaki Hori; Tim; K. Marks; Jonathan Le Roux; Chiori Hori

arXiv:2110.06894·cs.CL·October 14, 2021

Audio-Visual Scene-Aware Dialog and Reasoning using Audio-Visual Transformers with Joint Student-Teacher Learning

Ankit P. Shah, Shijie Geng, Peng Gao, Anoop Cherian, Takaaki Hori, Tim, K. Marks, Jonathan Le Roux, Chiori Hori

PDF

Open Access

TL;DR

This paper advances audio-visual scene-aware dialog by introducing a new task requiring temporal reasoning without relying on human descriptions, along with dataset extensions, baseline models, and state-of-the-art results.

Contribution

It proposes a new AVSD task with temporal reasoning, extends the dataset, and develops models with joint student-teacher learning and multimodal fusion.

Findings

01

Achieved state-of-the-art performance on AVSD datasets.

02

Developed two temporal reasoning methods: attention-based and region proposal network.

03

Extended dataset with human-generated temporal reasoning data.

Abstract

In previous work, we have proposed the Audio-Visual Scene-Aware Dialog (AVSD) task, collected an AVSD dataset, developed AVSD technologies, and hosted an AVSD challenge track at both the 7th and 8th Dialog System Technology Challenges (DSTC7, DSTC8). In these challenges, the best-performing systems relied heavily on human-generated descriptions of the video content, which were available in the datasets but would be unavailable in real-world applications. To promote further advancements for real-world applications, we proposed a third AVSD challenge, at DSTC10, with two modifications: 1) the human-created description is unavailable at inference time, and 2) systems must demonstrate temporal reasoning by finding evidence from the video to support each answer. This paper introduces the new task that includes temporal reasoning and our new extension of the AVSD dataset for DSTC10, for which…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Video Analysis and Summarization · Advanced Image and Video Retrieval Techniques