Audio-Visual Scene Classification Using A Transfer Learning Based Joint   Optimization Strategy

Chengxin Chen; Meng Wang; Pengyuan Zhang

arXiv:2204.11420·cs.CV·April 26, 2022·1 cites

Audio-Visual Scene Classification Using A Transfer Learning Based Joint Optimization Strategy

Chengxin Chen, Meng Wang, Pengyuan Zhang

PDF

Open Access

TL;DR

This paper introduces a joint training framework for audio-visual scene classification that directly optimizes acoustic and visual features together, outperforming traditional pipeline methods and achieving state-of-the-art results.

Contribution

The study proposes a novel joint optimization strategy for AVSC that integrates acoustic and visual encoders during training, improving performance over existing pipeline approaches.

Findings

01

Achieves a log loss of 0.1517 on the test set.

02

Attains an accuracy of 94.59% on the test fold.

03

Outperforms previous state-of-the-art methods.

Abstract

Recently, audio-visual scene classification (AVSC) has attracted increasing attention from multidisciplinary communities. Previous studies tended to adopt a pipeline training strategy, which uses well-trained visual and acoustic encoders to extract high-level representations (embeddings) first, then utilizes them to train the audio-visual classifier. In this way, the extracted embeddings are well suited for uni-modal classifiers, but not necessarily suited for multi-modal ones. In this paper, we propose a joint training framework, using the acoustic features and raw images directly as inputs for the AVSC task. Specifically, we retrieve the bottom layers of pre-trained image models as visual encoder, and jointly optimize the scene classifier and 1D-CNN based acoustic encoder during training. We evaluate the approach on the development dataset of TAU Urban Audio-Visual Scenes 2021. The…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Speech and Audio Processing · Digital Media Forensic Detection