Improving Audio-Text Retrieval via Hierarchical Cross-Modal Interaction and Auxiliary Captions
Yifei Xin, Yuexian Zou

TL;DR
This paper proposes a hierarchical cross-modal interaction method and auxiliary captioning framework to improve audio-text retrieval by capturing fine-grained relationships and enhancing audio representations.
Contribution
It introduces a novel hierarchical interaction approach and auxiliary captioning framework, addressing fine-grained cross-modal relationships and leveraging generated captions for improved retrieval.
Findings
HCI significantly improves ATR performance.
Auxiliary captioning yields stable performance gains.
The combined approach outperforms existing methods.
Abstract
Most existing audio-text retrieval (ATR) methods focus on constructing contrastive pairs between whole audio clips and complete caption sentences, while ignoring fine-grained cross-modal relationships, e.g., short segments and phrases or frames and words. In this paper, we introduce a hierarchical cross-modal interaction (HCI) method for ATR by simultaneously exploring clip-sentence, segment-phrase, and frame-word relationships, achieving a comprehensive multi-modal semantic comparison. Besides, we also present a novel ATR framework that leverages auxiliary captions (AC) generated by a pretrained captioner to perform feature interaction between audio and generated captions, which yields enhanced audio representations and is complementary to the original ATR matching branch. The audio and generated captions can also form new audio-text pairs as data augmentation for training. Experiments…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Speech and Audio Processing · Speech Recognition and Synthesis
