TACO: Training-free Sound Prompted Segmentation via Semantically Constrained Audio-visual CO-factorization
Hugo Malard, Michel Olvera, Stephane Lathuiliere, Slim Essid

TL;DR
This paper introduces a training-free method for sound-prompted image segmentation that uses non-negative matrix factorization on pre-trained models to identify shared concepts, achieving state-of-the-art results.
Contribution
The novel approach leverages NMF on frozen pre-trained models for unsupervised sound-guided segmentation without additional training.
Findings
Achieves state-of-the-art unsupervised segmentation performance.
Significantly outperforms previous unsupervised methods.
Demonstrates high generalization with frozen pre-trained models.
Abstract
Large-scale pre-trained audio and image models demonstrate an unprecedented degree of generalization, making them suitable for a wide range of applications. Here, we tackle the specific task of sound-prompted segmentation, aiming to segment image regions corresponding to objects heard in an audio signal. Most existing approaches tackle this problem by fine-tuning pre-trained models or by training additional modules specifically for the task. We adopt a different strategy: we introduce a training-free approach that leverages Non-negative Matrix Factorization (NMF) to co-factorize audio and visual features from pre-trained models so as to reveal shared interpretable concepts. These concepts are passed on to an open-vocabulary segmentation model for precise segmentation maps. By using frozen pre-trained models, our method achieves high generalization and establishes state-of-the-art…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Music and Audio Processing · Hearing Loss and Rehabilitation
MethodsADaptive gradient method with the OPTimal convergence rate
