Revisiting Pre-training in Audio-Visual Learning
Ruoxuan Feng, Wenke Xia, Di Hu

TL;DR
This paper investigates the effectiveness of pre-trained models in complex audio-visual learning scenarios, identifies issues like dead channels and negative effects of strong uni-modal encoders, and proposes novel strategies to enhance their utilization.
Contribution
It introduces Adaptive Batchnorm Re-initialization (ABRi) and a two-stage Fusion Tuning strategy to improve pre-trained model performance in multi-modal audio-visual tasks.
Findings
ABRi mitigates dead channel problems in cross-modal initialization.
Fusion Tuning enhances cooperation between uni-modal encoders.
Proposed methods boost audio-visual learning performance.
Abstract
Pre-training technique has gained tremendous success in enhancing model performance on various tasks, but found to perform worse than training from scratch in some uni-modal situations. This inspires us to think: are the pre-trained models always effective in the more complex multi-modal scenario, especially for the heterogeneous modalities such as audio and visual ones? We find that the answer is No. Specifically, we explore the effects of pre-trained models on two audio-visual learning scenarios: cross-modal initialization and multi-modal joint learning. When cross-modal initialization is applied, the phenomena of "dead channel" caused by abnormal Batchnorm parameters hinders the utilization of model capacity. Thus, we propose Adaptive Batchnorm Re-initialization (ABRi) to better exploit the capacity of pre-trained models for target tasks. In multi-modal joint learning, we find a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Music and Audio Processing · Hearing Loss and Rehabilitation
MethodsL1 Regularization · Adaptive Masking
