Scaling Audio-Text Retrieval with Multimodal Large Language Models

Jilan Xu; Carl Thom\'e; Danijela Horak; Weidi Xie; Andrew Zisserman

arXiv:2602.18010·cs.SD·February 23, 2026

Scaling Audio-Text Retrieval with Multimodal Large Language Models

Jilan Xu, Carl Thom\'e, Danijela Horak, Weidi Xie, Andrew Zisserman

PDF

Open Access 6 Models

TL;DR

AuroLA introduces a scalable, MLLM-based framework for audio-text retrieval that outperforms existing models with less training data by leveraging diverse data, novel training losses, and deep cross-modal re-ranking.

Contribution

The paper presents a novel MLLM-based retrieval framework with a scalable data pipeline, hybrid loss, and re-ranking, significantly improving performance over state-of-the-art methods.

Findings

01

AuroLA outperforms state-of-the-art models like PE-AV.

02

It achieves comparable results with only 1% of PE-AV's training data.

03

Scaling dataset size and model capacity enhances retrieval performance.

Abstract

Audio-text retrieval is crucial for bridging acoustic signals and natural language. While contrastive dual-encoder architectures like CLAP have shown promise, they are fundamentally limited by the capacity of small-scale encoders. Specifically, the text encoders struggle to understand complex queries that require reasoning or world knowledge. In this paper, we propose AuroLA, a novel contrastive language-audio pre-training framework that re-purposes Multimodal Large Language Models (MLLMs) as a unified backbone for retrieval. Specifically, we make three contributions: (i) we construct a scalable data pipeline that curates diverse audio from multiple sources and generates multi-granular captions, ranging from long descriptions to structured tags, via automated annotation; (ii) we adapt an MLLM for retrieval by prompting it to summarize the audio/text input and using the hidden state of a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Speech Recognition and Synthesis · Topic Modeling