Where and When: Space-Time Attention for Audio-Visual Explanations
Yanbei Chen, Thomas Hummel, A. Sophia Koepke, Zeynep Akata

TL;DR
This paper introduces a space-time attention network for audio-visual recognition that provides localized explanations of when and where relevant cues occur, improving understanding and performance in multi-modal video event detection.
Contribution
It presents a novel learnable explanation method using space-time attention for dynamic multi-modal data, advancing explainability in audio-visual recognition.
Findings
Outperforms existing methods on three audio-visual datasets.
Demonstrates superior accuracy in video event recognition.
Provides explainability through localization and temporal attribution.
Abstract
Explaining the decision of a multi-modal decision-maker requires to determine the evidence from both modalities. Recent advances in XAI provide explanations for models trained on still images. However, when it comes to modeling multiple sensory modalities in a dynamic world, it remains underexplored how to demystify the mysterious dynamics of a complex multi-modal model. In this work, we take a crucial step forward and explore learnable explanations for audio-visual recognition. Specifically, we propose a novel space-time attention network that uncovers the synergistic dynamics of audio and visual data over both space and time. Our model is capable of predicting the audio-visual video events, while justifying its decision by localizing where the relevant visual cues appear, and when the predicted sounds occur in videos. We benchmark our model on three audio-visual video event datasets,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Music Technology and Sound Studies · Neuroscience and Music Perception
