Exploring Context, Attention and Audio Features for Audio Visual   Scene-Aware Dialog

Shachi H Kumar; Eda Okur; Saurav Sahay; Jonathan Huang; Lama Nachman

arXiv:1912.10132·cs.CL·December 27, 2019·1 cites

Exploring Context, Attention and Audio Features for Audio Visual Scene-Aware Dialog

Shachi H Kumar, Eda Okur, Saurav Sahay, Jonathan Huang, Lama Nachman

PDF

Open Access

TL;DR

This paper investigates how context, attention mechanisms, and audio features enhance audio-visual scene-aware dialog systems, demonstrating improved performance over baseline models using multimodal grounding and classification techniques.

Contribution

It introduces the integration of conversational topics, multimodal attention, and an audio classification ConvNet into an end-to-end scene-aware dialog system architecture.

Findings

01

Certain model variations outperform the baseline system on AVSD dataset

02

Incorporating topics and attention improves dialog system understanding

03

Audio classification enhances multimodal grounding effectiveness

Abstract

We are witnessing a confluence of vision, speech and dialog system technologies that are enabling the IVAs to learn audio-visual groundings of utterances and have conversations with users about the objects, activities and events surrounding them. Recent progress in visual grounding techniques and Audio Understanding are enabling machines to understand shared semantic concepts and listen to the various sensory events in the environment. With audio and visual grounding methods, end-to-end multimodal SDS are trained to meaningfully communicate with us in natural language about the real dynamic audio-visual sensory world around us. In this work, we explore the role of `topics' as the context of the conversation along with multimodal attention into such an end-to-end audio-visual scene-aware dialog system architecture. We also incorporate an end-to-end audio classification ConvNet, AclNet,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Video Analysis and Summarization · Multimodal Machine Learning Applications

MethodsTest