Audio-visual scene classification: analysis of DCASE 2021 Challenge submissions
Shanshan Wang, Toni Heittola, Annamaria Mesaros, Tuomas Virtanen

TL;DR
This paper analyzes the DCASE 2021 Challenge on audio-visual scene classification, highlighting the effectiveness of large pretrained models, transfer learning, data augmentation, and multi-modal approaches in achieving superior performance.
Contribution
It provides a detailed analysis of the top-performing methods and techniques used in the challenge, emphasizing the importance of multi-modal data and transfer learning.
Findings
Top systems outperform baseline with logloss of 0.195 and 93.8% accuracy.
Use of large pretrained models like ResNet and EfficientNet is common among top entries.
All top 5 teams employed multi-modal audio-visual methods.
Abstract
This paper presents the details of the Audio-Visual Scene Classification task in the DCASE 2021 Challenge (Task 1 Subtask B). The task is concerned with classification using audio and video modalities, using a dataset of synchronized recordings. This task has attracted 43 submissions from 13 different teams around the world. Among all submissions, more than half of the submitted systems have better performance than the baseline. The common techniques among the top systems are the usage of large pretrained models such as ResNet or EfficientNet which are trained for the task-specific problem. Fine-tuning, transfer learning, and data augmentation techniques are also employed to boost the performance. More importantly, multi-modal methods using both audio and video are employed by all the top 5 teams. The best system among all achieved a logloss of 0.195 and accuracy of 93.8%, compared to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Speech and Audio Processing · Digital Media Forensic Detection
MethodsPointwise Convolution · Depthwise Convolution · Residual Connection · Global Average Pooling · Kaiming Initialization · Sigmoid Activation · Depthwise Separable Convolution · Residual Block · Bottleneck Residual Block · Max Pooling
