TL;DR
This paper introduces a noise-robust cross-lingual cross-modal retrieval method for low-resource languages, leveraging machine translation and self-distillation to improve retrieval accuracy without extra labeled data.
Contribution
It proposes a novel multi-view self-distillation approach with cross-attention and back-translation techniques to enhance noise robustness in low-resource language retrieval tasks.
Findings
Significant performance improvements on three cross-modal retrieval benchmarks.
Effective noise reduction in textual embeddings from machine translation.
Compatibility with pre-trained vision-and-language models like CLIP.
Abstract
Despite the recent developments in the field of cross-modal retrieval, there has been less research focusing on low-resource languages due to the lack of manually annotated datasets. In this paper, we propose a noise-robust cross-lingual cross-modal retrieval method for low-resource languages. To this end, we use Machine Translation (MT) to construct pseudo-parallel sentence pairs for low-resource languages. However, as MT is not perfect, it tends to introduce noise during translation, rendering textual embeddings corrupted and thereby compromising the retrieval performance. To alleviate this, we introduce a multi-view self-distillation method to learn noise-robust target-language representations, which employs a cross-attention module to generate soft pseudo-targets to provide direct supervision from the similarity-based view and feature-based view. Besides, inspired by the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsContrastive Language-Image Pre-training · Concatenated Skip Connection · Softmax
