Audio-visual scene classification: analysis of DCASE 2021 Challenge   submissions

Shanshan Wang; Toni Heittola; Annamaria Mesaros; Tuomas Virtanen

arXiv:2105.13675·eess.AS·July 21, 2021·5 cites

Audio-visual scene classification: analysis of DCASE 2021 Challenge submissions

Shanshan Wang, Toni Heittola, Annamaria Mesaros, Tuomas Virtanen

PDF

Open Access

TL;DR

This paper analyzes the DCASE 2021 Challenge on audio-visual scene classification, highlighting the effectiveness of large pretrained models, transfer learning, data augmentation, and multi-modal approaches in achieving superior performance.

Contribution

It provides a detailed analysis of the top-performing methods and techniques used in the challenge, emphasizing the importance of multi-modal data and transfer learning.

Findings

01

Top systems outperform baseline with logloss of 0.195 and 93.8% accuracy.

02

Use of large pretrained models like ResNet and EfficientNet is common among top entries.

03

All top 5 teams employed multi-modal audio-visual methods.

Abstract

This paper presents the details of the Audio-Visual Scene Classification task in the DCASE 2021 Challenge (Task 1 Subtask B). The task is concerned with classification using audio and video modalities, using a dataset of synchronized recordings. This task has attracted 43 submissions from 13 different teams around the world. Among all submissions, more than half of the submitted systems have better performance than the baseline. The common techniques among the top systems are the usage of large pretrained models such as ResNet or EfficientNet which are trained for the task-specific problem. Fine-tuning, transfer learning, and data augmentation techniques are also employed to boost the performance. More importantly, multi-modal methods using both audio and video are employed by all the top 5 teams. The best system among all achieved a logloss of 0.195 and accuracy of 93.8%, compared to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Speech and Audio Processing · Digital Media Forensic Detection

MethodsPointwise Convolution · Depthwise Convolution · Residual Connection · Global Average Pooling · Kaiming Initialization · Sigmoid Activation · Depthwise Separable Convolution · Residual Block · Bottleneck Residual Block · Max Pooling