CAD -- Contextual Multi-modal Alignment for Dynamic AVQA
Asmar Nadeem, Adrian Hilton, Robert Dawes, Graham Thomas, Armin, Mustafa

TL;DR
This paper introduces CAD, a novel multi-modal alignment network for AVQA that improves spatial, temporal, and semantic alignment of audio-visual data, leading to significant performance gains over existing methods.
Contribution
The paper presents a comprehensive end-to-end framework with novel alignment modules for AVQA, including a parameter-free stochastic block, a self-supervised pre-training technique, and a cross-attention mechanism.
Findings
Achieves 9.4% improvement on MUSIC-AVQA dataset.
Enhances spatial, temporal, and semantic alignment in AVQA.
Can be integrated into existing methods to boost performance.
Abstract
In the context of Audio Visual Question Answering (AVQA) tasks, the audio visual modalities could be learnt on three levels: 1) Spatial, 2) Temporal, and 3) Semantic. Existing AVQA methods suffer from two major shortcomings; the audio-visual (AV) information passing through the network isn't aligned on Spatial and Temporal levels; and, inter-modal (audio and visual) Semantic information is often not balanced within a context; this results in poor performance. In this paper, we propose a novel end-to-end Contextual Multi-modal Alignment (CAD) network that addresses the challenges in AVQA methods by i) introducing a parameter-free stochastic Contextual block that ensures robust audio and visual alignment on the Spatial level; ii) proposing a pre-training technique for dynamic audio and visual alignment on Temporal level in a self-supervised setting, and iii) introducing a cross-attention…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
CAD – Contextual Multi-Modal Alignment for Dynamic AVQA· youtube
Taxonomy
TopicsMusic and Audio Processing · Speech and Audio Processing · Multimodal Machine Learning Applications
