ATRI: Mitigating Multilingual Audio Text Retrieval Inconsistencies by Reducing Data Distribution Errors
Yuguo Yin, Yuxin Xie, Wenyuan Yang, Dongchao Yang, Jinghan Ru, Xianwei Zhuang, Liming Liang, Yuexian Zou

TL;DR
This paper introduces ATRI, a novel approach for multilingual audio-text retrieval that reduces data distribution errors to improve consistency and recall across multiple languages, achieving state-of-the-art results.
Contribution
The paper presents a theoretical analysis of inconsistencies in ML-ATR and proposes a new scheme using contrastive learning to mitigate data distribution errors.
Findings
Achieves state-of-the-art recall on multilingual datasets.
Improves consistency across eight languages.
Reduces data distribution errors in ML-ATR.
Abstract
Multilingual audio-text retrieval (ML-ATR) is a challenging task that aims to retrieve audio clips or multilingual texts from databases. However, existing ML-ATR schemes suffer from inconsistencies for instance similarity matching across languages. We theoretically analyze the inconsistency in terms of both multilingual modal alignment direction error and weight error, and propose the theoretical weight error upper bound for quantifying the inconsistency. Based on the analysis of the weight error upper bound, we find that the inconsistency problem stems from the data distribution error caused by random sampling of languages. We propose a consistent ML-ATR scheme using 1-to-k contrastive learning and audio-English co-anchor contrastive learning, aiming to mitigate the negative impact of data distribution error on recall and consistency in ML-ATR. Experimental results on the translated…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsMusic and Audio Processing · Speech Recognition and Synthesis · Diverse Musicological Studies
MethodsContrastive Learning
