Omni-Embed-Audio: Leveraging Multimodal LLMs for Robust Audio-Text Retrieval
HaeJun Yoo, Yongseop Shin, Insung Lee, Myoung-Wan Koo, and Du-Seong Chang

TL;DR
Omni-Embed-Audio (OEA) leverages multimodal LLMs to improve audio-text retrieval robustness, especially in handling natural user queries and negative distractors, beyond traditional caption-based benchmarks.
Contribution
The paper introduces OEA, a retrieval encoder utilizing multimodal LLMs with native audio understanding, and proposes User-Intent Queries for more realistic robustness evaluation.
Findings
OEA achieves comparable retrieval performance to state-of-the-art models.
OEA significantly improves text-to-text retrieval accuracy (+22%).
OEA demonstrates superior hard negative discrimination (+4.3% HNSR@10).
Abstract
Audio-text retrieval systems based on Contrastive Language-Audio Pretraining (CLAP) achieve strong performance on traditional benchmarks; however, these benchmarks rely on caption-style queries that differ substantially from real-world search behavior, limiting their assessment of practical retrieval robustness. We present Omni-Embed-Audio (OEA), a retrieval-oriented encoder leveraging multimodal LLMs with native audio understanding. To systematically evaluate robustness beyond caption-style queries, we introduce User-Intent Queries (UIQs) - five formulations reflecting natural search behaviors: questions, commands, keyword tags, paraphrases, and exclusion-based negative queries. For negative queries, we develop a hard negative mining pipeline and propose discrimination metrics (HNSR, TFR) assessing models' ability to suppress acoustically similar distractors. Experiments on AudioCaps,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
