Multimodal Self-Supervised Learning of General Audio Representations
Luyu Wang, Pauline Luc, Adria Recasens, Jean-Baptiste Alayrac, Aaron, van den Oord

TL;DR
This paper introduces a multimodal self-supervised learning framework that leverages video data to improve general audio representations, achieving state-of-the-art results on AudioSet and excelling in various audio tasks.
Contribution
It demonstrates that incorporating video information enhances audio feature learning without high-resolution images and employs sample mixing augmentations to improve performance.
Findings
Achieved 42.4 mAP on AudioSet classification.
Outperformed previous self-supervised methods.
Effective across multiple non-semantic audio tasks.
Abstract
We present a multimodal framework to learn general audio representations from videos. Existing contrastive audio representation learning methods mainly focus on using the audio modality alone during training. In this work, we show that additional information contained in video can be utilized to greatly improve the learned features. First, we demonstrate that our contrastive framework does not require high resolution images to learn good audio features. This allows us to scale up the training batch size, while keeping the computational load incurred by the additional video modality to a reasonable level. Second, we use augmentations that mix together different samples. We show that this is effective to make the proxy task harder, which leads to substantial performance improvements when increasing the batch size. As a result, our audio model achieves a state-of-the-art of 42.4 mAP on the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Speech and Audio Processing · Speech Recognition and Synthesis
