Revisiting Pre-training in Audio-Visual Learning

Ruoxuan Feng; Wenke Xia; Di Hu

arXiv:2302.03533·cs.CV·February 20, 2023·5 cites

Revisiting Pre-training in Audio-Visual Learning

Ruoxuan Feng, Wenke Xia, Di Hu

PDF

Open Access 1 Repo

TL;DR

This paper investigates the effectiveness of pre-trained models in complex audio-visual learning scenarios, identifies issues like dead channels and negative effects of strong uni-modal encoders, and proposes novel strategies to enhance their utilization.

Contribution

It introduces Adaptive Batchnorm Re-initialization (ABRi) and a two-stage Fusion Tuning strategy to improve pre-trained model performance in multi-modal audio-visual tasks.

Findings

01

ABRi mitigates dead channel problems in cross-modal initialization.

02

Fusion Tuning enhances cooperation between uni-modal encoders.

03

Proposed methods boost audio-visual learning performance.

Abstract

Pre-training technique has gained tremendous success in enhancing model performance on various tasks, but found to perform worse than training from scratch in some uni-modal situations. This inspires us to think: are the pre-trained models always effective in the more complex multi-modal scenario, especially for the heterogeneous modalities such as audio and visual ones? We find that the answer is No. Specifically, we explore the effects of pre-trained models on two audio-visual learning scenarios: cross-modal initialization and multi-modal joint learning. When cross-modal initialization is applied, the phenomena of "dead channel" caused by abnormal Batchnorm parameters hinders the utilization of model capacity. Thus, we propose Adaptive Batchnorm Re-initialization (ABRi) to better exploit the capacity of pre-trained models for target tasks. In multi-modal joint learning, we find a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

gewu-lab/revisiting-pre-training-in-audio-visual-learning
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Music and Audio Processing · Hearing Loss and Rehabilitation

MethodsL1 Regularization · Adaptive Masking