CAD -- Contextual Multi-modal Alignment for Dynamic AVQA

Asmar Nadeem; Adrian Hilton; Robert Dawes; Graham Thomas; Armin; Mustafa

arXiv:2310.16754·cs.CV·October 31, 2023·1 cites

CAD -- Contextual Multi-modal Alignment for Dynamic AVQA

Asmar Nadeem, Adrian Hilton, Robert Dawes, Graham Thomas, Armin, Mustafa

PDF

Open Access 1 Video

TL;DR

This paper introduces CAD, a novel multi-modal alignment network for AVQA that improves spatial, temporal, and semantic alignment of audio-visual data, leading to significant performance gains over existing methods.

Contribution

The paper presents a comprehensive end-to-end framework with novel alignment modules for AVQA, including a parameter-free stochastic block, a self-supervised pre-training technique, and a cross-attention mechanism.

Findings

01

Achieves 9.4% improvement on MUSIC-AVQA dataset.

02

Enhances spatial, temporal, and semantic alignment in AVQA.

03

Can be integrated into existing methods to boost performance.

Abstract

In the context of Audio Visual Question Answering (AVQA) tasks, the audio visual modalities could be learnt on three levels: 1) Spatial, 2) Temporal, and 3) Semantic. Existing AVQA methods suffer from two major shortcomings; the audio-visual (AV) information passing through the network isn't aligned on Spatial and Temporal levels; and, inter-modal (audio and visual) Semantic information is often not balanced within a context; this results in poor performance. In this paper, we propose a novel end-to-end Contextual Multi-modal Alignment (CAD) network that addresses the challenges in AVQA methods by i) introducing a parameter-free stochastic Contextual block that ensures robust audio and visual alignment on the Spatial level; ii) proposing a pre-training technique for dynamic audio and visual alignment on Temporal level in a self-supervised setting, and iii) introducing a cross-attention…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

CAD – Contextual Multi-Modal Alignment for Dynamic AVQA· youtube

Taxonomy

TopicsMusic and Audio Processing · Speech and Audio Processing · Multimodal Machine Learning Applications