Efficient Audio-Visual Fusion for Video Classification

Mahrukh Awan; Asmar Nadeem; Armin Mustafa

arXiv:2411.05603·cs.CV·November 11, 2024

Efficient Audio-Visual Fusion for Video Classification

Mahrukh Awan, Asmar Nadeem, Armin Mustafa

PDF

Open Access

TL;DR

Attend-Fusion is a new efficient audio-visual fusion method for video classification that balances high performance with low model complexity, validated on YouTube-8M.

Contribution

It introduces a novel fusion approach that effectively combines audio and visual data in a compact model architecture.

Findings

01

Achieves competitive accuracy on YouTube-8M

02

Reduces model complexity significantly

03

Maintains performance with fewer parameters

Abstract

We present Attend-Fusion, a novel and efficient approach for audio-visual fusion in video classification tasks. Our method addresses the challenge of exploiting both audio and visual modalities while maintaining a compact model architecture. Through extensive experiments on the YouTube-8M dataset, we demonstrate that our Attend-Fusion achieves competitive performance with significantly reduced model complexity compared to larger baseline models.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Speech and Audio Processing · Digital Media Forensic Detection