An Audio-Visual Dataset and Deep Learning Frameworks for Crowded Scene   Classification

Lam Pham; Dat Ngo; Phu X. Nguyen; Truong Hoang; Alexander Schindler

arXiv:2112.09172·cs.CV·December 20, 2021

An Audio-Visual Dataset and Deep Learning Frameworks for Crowded Scene Classification

Lam Pham, Dat Ngo, Phu X. Nguyen, Truong Hoang, Alexander Schindler

PDF

Open Access 1 Repo

TL;DR

This paper introduces a new audio-visual dataset of crowded scenes and proposes deep learning frameworks for classifying these scenes, demonstrating that combining audio and visual data yields high accuracy in real-world conditions.

Contribution

The paper provides a novel dataset of crowded scenes and develops deep learning models for audio-visual scene classification, achieving state-of-the-art accuracy through data fusion.

Findings

01

Audio and visual data independently improve classification performance.

02

Ensemble models combining audio and visual inputs reach 95.7% accuracy.

03

The dataset enables real-world scene classification research.

Abstract

This paper presents a task of audio-visual scene classification (SC) where input videos are classified into one of five real-life crowded scenes: 'Riot', 'Noise-Street', 'Firework-Event', 'Music-Event', and 'Sport-Atmosphere'. To this end, we firstly collect an audio-visual dataset (videos) of these five crowded contexts from Youtube (in-the-wild scenes). Then, a wide range of deep learning frameworks are proposed to deploy either audio or visual input data independently. Finally, results obtained from high-performed deep learning frameworks are fused to achieve the best accuracy score. Our experimental results indicate that audio and visual input factors independently contribute to the SC task's performance. Significantly, an ensemble of deep learning frameworks exploring either audio or visual input data can achieve the best accuracy of 95.7%.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

phamdanglam1986/An-application-demo-of-audio-visual-crowded-scene-classification-
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Speech and Audio Processing · Video Analysis and Summarization