Exploring Context, Attention and Audio Features for Audio Visual Scene-Aware Dialog
Shachi H Kumar, Eda Okur, Saurav Sahay, Jonathan Huang, Lama Nachman

TL;DR
This paper investigates how context, attention mechanisms, and audio features enhance audio-visual scene-aware dialog systems, demonstrating improved performance over baseline models using multimodal grounding and classification techniques.
Contribution
It introduces the integration of conversational topics, multimodal attention, and an audio classification ConvNet into an end-to-end scene-aware dialog system architecture.
Findings
Certain model variations outperform the baseline system on AVSD dataset
Incorporating topics and attention improves dialog system understanding
Audio classification enhances multimodal grounding effectiveness
Abstract
We are witnessing a confluence of vision, speech and dialog system technologies that are enabling the IVAs to learn audio-visual groundings of utterances and have conversations with users about the objects, activities and events surrounding them. Recent progress in visual grounding techniques and Audio Understanding are enabling machines to understand shared semantic concepts and listen to the various sensory events in the environment. With audio and visual grounding methods, end-to-end multimodal SDS are trained to meaningfully communicate with us in natural language about the real dynamic audio-visual sensory world around us. In this work, we explore the role of `topics' as the context of the conversation along with multimodal attention into such an end-to-end audio-visual scene-aware dialog system architecture. We also incorporate an end-to-end audio classification ConvNet, AclNet,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Video Analysis and Summarization · Multimodal Machine Learning Applications
MethodsTest
