Deep Learning Frameworks Applied For Audio-Visual Scene Classification

Lam Pham; Alexander Schindler; Mina Sch\"utz; Jasmin Lampert; Sven; Schlarb; Ross King

arXiv:2106.06840·cs.SD·June 17, 2021

Deep Learning Frameworks Applied For Audio-Visual Scene Classification

Lam Pham, Alexander Schindler, Mina Sch\"utz, Jasmin Lampert, Sven, Schlarb, Ross King

PDF

Open Access

TL;DR

This paper develops deep learning frameworks for audio-visual scene classification, demonstrating how combining audio and visual features improves accuracy on a standard dataset, with ensemble methods achieving the best results.

Contribution

Introduces deep learning frameworks for audio-visual scene classification and shows how feature combination enhances performance on the DCASE dataset.

Findings

01

Audio-only accuracy: 82.2%

02

Visual-only accuracy: 91.1%

03

Audio-visual ensemble accuracy: 93.9%

Abstract

In this paper, we present deep learning frameworks for audio-visual scene classification (SC) and indicate how individual visual and audio features as well as their combination affect SC performance. Our extensive experiments, which are conducted on DCASE (IEEE AASP Challenge on Detection and Classification of Acoustic Scenes and Events) Task 1B development dataset, achieve the best classification accuracy of 82.2%, 91.1%, and 93.9% with audio input only, visual input only, and both audio-visual input, respectively. The highest classification accuracy of 93.9%, obtained from an ensemble of audio-based and visual-based frameworks, shows an improvement of 16.5% compared with DCASE baseline.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Speech and Audio Processing · Digital Media Forensic Detection