Scaling Audio-Text Retrieval with Multimodal Large Language Models
Jilan Xu, Carl Thom\'e, Danijela Horak, Weidi Xie, Andrew Zisserman

TL;DR
AuroLA introduces a scalable, MLLM-based framework for audio-text retrieval that outperforms existing models with less training data by leveraging diverse data, novel training losses, and deep cross-modal re-ranking.
Contribution
The paper presents a novel MLLM-based retrieval framework with a scalable data pipeline, hybrid loss, and re-ranking, significantly improving performance over state-of-the-art methods.
Findings
AuroLA outperforms state-of-the-art models like PE-AV.
It achieves comparable results with only 1% of PE-AV's training data.
Scaling dataset size and model capacity enhances retrieval performance.
Abstract
Audio-text retrieval is crucial for bridging acoustic signals and natural language. While contrastive dual-encoder architectures like CLAP have shown promise, they are fundamentally limited by the capacity of small-scale encoders. Specifically, the text encoders struggle to understand complex queries that require reasoning or world knowledge. In this paper, we propose AuroLA, a novel contrastive language-audio pre-training framework that re-purposes Multimodal Large Language Models (MLLMs) as a unified backbone for retrieval. Specifically, we make three contributions: (i) we construct a scalable data pipeline that curates diverse audio from multiple sources and generates multi-granular captions, ranging from long descriptions to structured tags, via automated annotation; (ii) we adapt an MLLM for retrieval by prompting it to summarize the audio/text input and using the hidden state of a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Speech Recognition and Synthesis · Topic Modeling
