Bridging Language Gaps in Audio-Text Retrieval
Zhiyong Yan, Heinrich Dinkel, Yongqing Wang, Jizhong Liu, Junbo Zhang,, Yujun Wang, Bin Wang

TL;DR
This paper introduces a multilingual audio-text retrieval method that enhances language support and achieves state-of-the-art results in English and multiple other languages with minimal additional data.
Contribution
It proposes a novel language enhancement technique using a multilingual encoder and a consistent ensemble distillation approach to improve cross-lingual audio-text retrieval.
Findings
State-of-the-art performance on AudioCaps and Clotho datasets.
Effective retrieval in seven languages with only 10% extra training data.
Supports variable-length audio-text retrieval efficiently.
Abstract
Audio-text retrieval is a challenging task, requiring the search for an audio clip or a text caption within a database. The predominant focus of existing research on English descriptions poses a limitation on the applicability of such models, given the abundance of non-English content in real-world data. To address these linguistic disparities, we propose a language enhancement (LE), using a multilingual text encoder (SONAR) to encode the text data with language-specific information. Additionally, we optimize the audio encoder through the application of consistent ensemble distillation (CED), enhancing support for variable-length audio-text retrieval. Our methodology excels in English audio-text retrieval, demonstrating state-of-the-art (SOTA) performance on commonly used datasets such as AudioCaps and Clotho. Simultaneously, the approach exhibits proficiency in retrieving content in…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Diverse Musicological Studies · Natural Language Processing Techniques
MethodsFocus · Contrastive Language-Image Pre-training
