Robust Audio-Text Retrieval via Cross-Modal Attention and Hybrid Loss
Meizhu Liu, Matthew Rowe, Amit Agarwal, Michael Avendi, Yassi Abbasi, Hitesh Laxmichand Patel, Paul Li, Kyu J. Han, Tao Sheng, Sujith Ravi, Dan Roth

TL;DR
This paper introduces a robust audio-text retrieval framework that uses cross-modal attention and a hybrid loss to improve semantic alignment, especially with noisy and long audio data.
Contribution
It proposes a novel multimodal retrieval model with a cross-modal refinement module and hybrid loss, enhancing robustness against noise and long audio sequences.
Findings
Outperforms prior methods on benchmark datasets.
Effectively handles noisy audio with SNR 5 to 15.
Improves stability with hybrid loss under small-batch training.
Abstract
Audio-text retrieval enables semantic alignment between audio content and natural language queries, supporting applications in multimedia search, accessibility, and surveillance. However, current state-of-the-art approaches struggle with long, noisy, and weakly labeled audio due to their reliance on contrastive learning and large-batch training. We propose a novel multimodal retrieval framework that refines audio and text embeddings using a cross-modal embedding refinement module combining transformer-based projection, linear mapping, and bidirectional attention. To further improve robustness, we introduce a hybrid loss function blending cosine similarity, , and contrastive objectives, enabling stable training even under small-batch constraints. Our approach efficiently handles long-form and noisy audio (SNR 5 to 15) via silence-aware chunking and attention-based…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
