Investigating the Sensitivity of Pre-trained Audio Embeddings to Common Effects
Victor Deng (ENS-PSL), Changhong Wang (LTCI, S2A, IDS), Gael Richard, (S2A, IDS, LTCI), Brian McFee (NYU)

TL;DR
This paper examines how pre-trained audio embeddings respond to common audio effects, revealing that these embeddings do not linearly encode effects and that removing effect directions does not enhance robustness.
Contribution
It introduces a method to analyze the sensitivity of audio embeddings to effects and demonstrates their high-dimensional, non-linear deformation in embedding space.
Findings
Embeddings move monotonically with effect strength along certain directions.
The deformation subspace is high-dimensional, indicating non-linearity.
Removing estimated effect directions does not improve robustness.
Abstract
In recent years, foundation models have significantly advanced data-driven systems across various domains. Yet, their underlying properties, especially when functioning as feature extractors, remain under-explored. In this paper, we investigate the sensitivity to audio effects of audio embeddings extracted from widely-used foundation models, including OpenL3, PANNs, and CLAP. We focus on audio effects as the source of sensitivity due to their prevalent presence in large audio datasets. By applying parameterized audio effects (gain, low-pass filtering, reverberation, and bitcrushing), we analyze the correlation between the deformation trajectories and the effect strength in the embedding space. We propose to quantify the dimensionality and linearizability of the deformation trajectories induced by audio effects using canonical correlation analysis. We find that there exists a direction…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsFocus
