Attend-Fusion: Efficient Audio-Visual Fusion for Video Classification

Mahrukh Awan; Asmar Nadeem; Muhammad Junaid Awan; Armin Mustafa; Syed; Sameed Husain

arXiv:2408.14441·cs.CV·August 27, 2024

Attend-Fusion: Efficient Audio-Visual Fusion for Video Classification

Mahrukh Awan, Asmar Nadeem, Muhammad Junaid Awan, Armin Mustafa, Syed, Sameed Husain

PDF

Open Access

TL;DR

Attend-Fusion introduces a compact audio-visual fusion model for video classification that maintains high accuracy while significantly reducing computational complexity, enabling efficient deployment in resource-constrained environments.

Contribution

The paper presents a novel, efficient AV fusion architecture that achieves comparable performance to larger models with substantially fewer parameters.

Findings

01

Attend-Fusion achieves 75.64% F1 score with 72M parameters.

02

It reduces model size by nearly 80% compared to larger baselines.

03

The approach demonstrates effective audio-visual integration for resource-efficient video classification.

Abstract

Exploiting both audio and visual modalities for video classification is a challenging task, as the existing methods require large model architectures, leading to high computational complexity and resource requirements. Smaller architectures, on the other hand, struggle to achieve optimal performance. In this paper, we propose Attend-Fusion, an audio-visual (AV) fusion approach that introduces a compact model architecture specifically designed to capture intricate audio-visual relationships in video data. Through extensive experiments on the challenging YouTube-8M dataset, we demonstrate that Attend-Fusion achieves an F1 score of 75.64\% with only 72M parameters, which is comparable to the performance of larger baseline models such as Fully-Connected Late Fusion (75.96\% F1 score, 341M parameters). Attend-Fusion achieves similar performance to the larger baseline model while reducing the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Speech and Audio Processing · Digital Media Forensic Detection