An Audio-Visual Dataset and Deep Learning Frameworks for Crowded Scene Classification
Lam Pham, Dat Ngo, Phu X. Nguyen, Truong Hoang, Alexander Schindler

TL;DR
This paper introduces a new audio-visual dataset of crowded scenes and proposes deep learning frameworks for classifying these scenes, demonstrating that combining audio and visual data yields high accuracy in real-world conditions.
Contribution
The paper provides a novel dataset of crowded scenes and develops deep learning models for audio-visual scene classification, achieving state-of-the-art accuracy through data fusion.
Findings
Audio and visual data independently improve classification performance.
Ensemble models combining audio and visual inputs reach 95.7% accuracy.
The dataset enables real-world scene classification research.
Abstract
This paper presents a task of audio-visual scene classification (SC) where input videos are classified into one of five real-life crowded scenes: 'Riot', 'Noise-Street', 'Firework-Event', 'Music-Event', and 'Sport-Atmosphere'. To this end, we firstly collect an audio-visual dataset (videos) of these five crowded contexts from Youtube (in-the-wild scenes). Then, a wide range of deep learning frameworks are proposed to deploy either audio or visual input data independently. Finally, results obtained from high-performed deep learning frameworks are fused to achieve the best accuracy score. Our experimental results indicate that audio and visual input factors independently contribute to the SC task's performance. Significantly, an ensemble of deep learning frameworks exploring either audio or visual input data can achieve the best accuracy of 95.7%.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Speech and Audio Processing · Video Analysis and Summarization
