Robust Audio-Text Retrieval via Cross-Modal Attention and Hybrid Loss

Meizhu Liu; Matthew Rowe; Amit Agarwal; Michael Avendi; Yassi Abbasi; Hitesh Laxmichand Patel; Paul Li; Kyu J. Han; Tao Sheng; Sujith Ravi; Dan Roth

arXiv:2604.23323·cs.CL·April 28, 2026

Robust Audio-Text Retrieval via Cross-Modal Attention and Hybrid Loss

Meizhu Liu, Matthew Rowe, Amit Agarwal, Michael Avendi, Yassi Abbasi, Hitesh Laxmichand Patel, Paul Li, Kyu J. Han, Tao Sheng, Sujith Ravi, Dan Roth

PDF

TL;DR

This paper introduces a robust audio-text retrieval framework that uses cross-modal attention and a hybrid loss to improve semantic alignment, especially with noisy and long audio data.

Contribution

It proposes a novel multimodal retrieval model with a cross-modal refinement module and hybrid loss, enhancing robustness against noise and long audio sequences.

Findings

01

Outperforms prior methods on benchmark datasets.

02

Effectively handles noisy audio with SNR 5 to 15.

03

Improves stability with hybrid loss under small-batch training.

Abstract

Audio-text retrieval enables semantic alignment between audio content and natural language queries, supporting applications in multimedia search, accessibility, and surveillance. However, current state-of-the-art approaches struggle with long, noisy, and weakly labeled audio due to their reliance on contrastive learning and large-batch training. We propose a novel multimodal retrieval framework that refines audio and text embeddings using a cross-modal embedding refinement module combining transformer-based projection, linear mapping, and bidirectional attention. To further improve robustness, we introduce a hybrid loss function blending cosine similarity, $L_{1}$ , and contrastive objectives, enabling stable training even under small-batch constraints. Our approach efficiently handles long-form and noisy audio (SNR 5 to 15) via silence-aware chunking and attention-based…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.