Sound and Visual Representation Learning with Multiple Pretraining Tasks
Arun Balajee Vasudevan, Dengxin Dai, Luc Van Gool

TL;DR
This paper introduces a Multi-SSL approach combining multiple self-supervised tasks for binaural sound and image data, demonstrating improved downstream task performance over single SSL and supervised models.
Contribution
It proposes a novel Multi-SSL framework with incremental learning that enhances feature representations across modalities, outperforming existing methods.
Findings
Multi-SSL with incremental learning outperforms single SSL and supervised models in binaural sound tasks.
Multi-SSL surpasses recent image SSL methods on VOC07 classification and COCO detection.
The approach generalizes well across different data modalities, improving downstream task results.
Abstract
Different self-supervised tasks (SSL) reveal different features from the data. The learned feature representations can exhibit different performance for each downstream task. In this light, this work aims to combine Multiple SSL tasks (Multi-SSL) that generalizes well for all downstream tasks. Specifically, for this study, we investigate binaural sounds and image data in isolation. For binaural sounds, we propose three SSL tasks namely, spatial alignment, temporal synchronization of foreground objects and binaural audio and temporal gap prediction. We investigate several approaches of Multi-SSL and give insights into the downstream task performance on video retrieval, spatial sound super resolution, and semantic prediction on the OmniAudio dataset. Our experiments on binaural sound representations demonstrate that Multi-SSL via incremental learning (IL) of SSL tasks outperforms single…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Speech and Audio Processing · Diverse Musicological Studies
