Efficient Audio-Visual Fusion for Video Classification
Mahrukh Awan, Asmar Nadeem, Armin Mustafa

TL;DR
Attend-Fusion is a new efficient audio-visual fusion method for video classification that balances high performance with low model complexity, validated on YouTube-8M.
Contribution
It introduces a novel fusion approach that effectively combines audio and visual data in a compact model architecture.
Findings
Achieves competitive accuracy on YouTube-8M
Reduces model complexity significantly
Maintains performance with fewer parameters
Abstract
We present Attend-Fusion, a novel and efficient approach for audio-visual fusion in video classification tasks. Our method addresses the challenge of exploiting both audio and visual modalities while maintaining a compact model architecture. Through extensive experiments on the YouTube-8M dataset, we demonstrate that our Attend-Fusion achieves competitive performance with significantly reduced model complexity compared to larger baseline models.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Speech and Audio Processing · Digital Media Forensic Detection
