TACO: Training-free Sound Prompted Segmentation via Semantically Constrained Audio-visual CO-factorization

Hugo Malard; Michel Olvera; Stephane Lathuiliere; Slim Essid

arXiv:2412.01488·eess.AS·May 27, 2025

TACO: Training-free Sound Prompted Segmentation via Semantically Constrained Audio-visual CO-factorization

Hugo Malard, Michel Olvera, Stephane Lathuiliere, Slim Essid

PDF

Open Access

TL;DR

This paper introduces a training-free method for sound-prompted image segmentation that uses non-negative matrix factorization on pre-trained models to identify shared concepts, achieving state-of-the-art results.

Contribution

The novel approach leverages NMF on frozen pre-trained models for unsupervised sound-guided segmentation without additional training.

Findings

01

Achieves state-of-the-art unsupervised segmentation performance.

02

Significantly outperforms previous unsupervised methods.

03

Demonstrates high generalization with frozen pre-trained models.

Abstract

Large-scale pre-trained audio and image models demonstrate an unprecedented degree of generalization, making them suitable for a wide range of applications. Here, we tackle the specific task of sound-prompted segmentation, aiming to segment image regions corresponding to objects heard in an audio signal. Most existing approaches tackle this problem by fine-tuning pre-trained models or by training additional modules specifically for the task. We adopt a different strategy: we introduce a training-free approach that leverages Non-negative Matrix Factorization (NMF) to co-factorize audio and visual features from pre-trained models so as to reveal shared interpretable concepts. These concepts are passed on to an open-vocabulary segmentation model for precise segmentation maps. By using frozen pre-trained models, our method achieves high generalization and establishes state-of-the-art…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Music and Audio Processing · Hearing Loss and Rehabilitation

MethodsADaptive gradient method with the OPTimal convergence rate