SSLAM: Enhancing Self-Supervised Models with Audio Mixtures for Polyphonic Soundscapes
Tony Alex, Sara Ahmed, Armin Mustafa, Muhammad Awais, Philip JB Jackson

TL;DR
SSLAM introduces a novel self-supervised learning approach that enhances audio models' ability to handle complex polyphonic soundscapes, improving performance on both monophonic and polyphonic datasets.
Contribution
The paper proposes SSLAM, a new method for self-supervised learning from audio mixtures, specifically designed to improve polyphonic audio understanding while maintaining monophonic performance.
Findings
SSLAM improves performance on polyphonic datasets with up to 9.1% mAP increase.
SSLAM achieves 50.2 mAP on AudioSet-2M, surpassing previous methods.
SSLAM maintains or exceeds standard benchmark performance on monophonic data.
Abstract
Self-supervised pre-trained audio networks have seen widespread adoption in real-world systems, particularly in multi-modal large language models. These networks are often employed in a frozen state, under the assumption that the SSL pre-training has sufficiently equipped them to handle real-world audio. However, a critical question remains: how well do these models actually perform in real-world conditions, where audio is typically polyphonic and complex, involving multiple overlapping sound sources? Current audio SSL methods are often benchmarked on datasets predominantly featuring monophonic audio, such as environmental sounds, and speech. As a result, the ability of SSL models to generalize to polyphonic audio, a common characteristic in natural scenarios, remains underexplored. This limitation raises concerns about the practical robustness of SSL models in more realistic audio…
Peer Reviews
Decision·ICLR 2025 Poster
The paper addresses a relevant challenge in audio SSL, especially as polyphonic environments are common in real-world audio. The two-stage training approach combined with the Source Retention Loss appears effective, allowing the model to learn from mixed audio in a way that preserves source integrity. The evaluations include a range of polyphonic datasets, and the method generally performs well across them. The component-wise analysis of the model’s objectives is also informative, showing the im
The paper does not provide an analysis of SSLAM’s generalization to unseen audio domains, such as speech or music distribution. It would be great to also observe if this technique is applicable to other domains’ downstream applications such as multi-speaker recognition or instrument recognition, where audio types could differ significantly from pre-training data.
1. The paper tackles the under-explored issue of polyphonic sound processing in self-supervised audio learning. This is crucial because real-world audio scenes rarely consist of isolated sounds, and models trained primarily on monophonic data may struggle to generalize effectively in realistic scenarios. 2. SSLAM introduces a novel training strategy by incorporating audio mixtures and a source retention loss, both well-motivated by principles of auditory scene analysis (specifically, the Ideal
1. The core contribution of SSLAM, training with mixtures of mixtures, closely resembles the MixIT [A] approach for unsupervised sound separation. The novelty seemingly lies in applying this concept within a self-supervised representation learning framework. However, the paper does not sufficiently justify why this adaptation is novel or contributes significant new insights beyond the well-established principles of mixture invariant training. 2. A major weakness is the absence of a comparison w
1. Clarity: paper is well written and apart from some sections, quite easy to read. 2. Soundness seems sufficient.
1. Small set of evaluated tasks: not enough diversity in evaluated downstream tasks. Only speech (keyword spotting) and in-domain audio classification tasks are evaluated. Instead of evaluating KS2 and KS1, either one would've sufficed. Dataset choice is also not motivated well enough. 2. Evaluation itself: no mean/std or confidence intervals are reported, which would be even more useful for polyphonic evaluations. Are the downstream results reported from a single test run? 3. The overall objec
Code & Models
Videos
Taxonomy
TopicsMusic and Audio Processing · Speech and Audio Processing · Music Technology and Sound Studies
