Audio-to-Image Bird Species Retrieval without Audio-Image Pairs via Text Distillation
Ilyass Moummad, Marius Miron, Lukas Rauch, David Robinson, Alexis Joly, Olivier Pietquin, Emmanuel Chemla, Matthieu Geist

TL;DR
This paper introduces a data-efficient method for audio-to-image bird species retrieval that leverages text as an intermediary, enabling alignment without paired audio-image data and outperforming baselines on multiple benchmarks.
Contribution
The authors propose a novel text distillation approach that transfers visual semantics into audio representations, facilitating effective audio-to-image retrieval without requiring paired data.
Findings
Achieves strong audio-to-image retrieval performance on bioacoustic benchmarks.
Improves audio-text alignment while maintaining audio discriminative power.
Outperforms baselines based on zero-shot models and learned mappings.
Abstract
Audio-to-image retrieval offers an interpretable alternative to audio-only classification for bioacoustic species recognition, but learning aligned audio-image representations is challenging due to the scarcity of paired audio-image data. We propose a simple and data-efficient approach that enables audio-to-image retrieval without any audio-image supervision. Our proposed method uses text as a semantic intermediary: we distill the text embedding space of a pretrained image-text model (BioCLIP-2), which encodes rich visual and taxonomic structure, into a pretrained audio-text model (BioLingual) by fine-tuning its audio encoder with a contrastive objective. This distillation transfers visually grounded semantics into the audio representation, inducing emergent alignment between audio and image embeddings without using images during training. We evaluate the resulting model on multiple…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
