Dynamic Adapter with Semantics Disentangling for Cross-lingual Cross-modal Retrieval
Rui Cai, Zhiyu Dong, Jianfeng Dong, Xun Wang

TL;DR
This paper introduces DASD, a dynamic adapter framework with semantics disentangling, to improve cross-lingual cross-modal retrieval by adapting to varied caption expressions without target-language labeled data.
Contribution
The paper proposes a novel dynamic adapter with semantics disentangling that adapts to input caption characteristics, enhancing cross-lingual cross-modal retrieval for low-resource languages.
Findings
Effective on multiple datasets for image-text and video-text retrieval.
Compatible with various vision-language pretraining models.
Improves retrieval performance without target-language annotations.
Abstract
Existing cross-modal retrieval methods typically rely on large-scale vision-language pair data. This makes it challenging to efficiently develop a cross-modal retrieval model for under-resourced languages of interest. Therefore, Cross-lingual Cross-modal Retrieval (CCR), which aims to align vision and the low-resource language (the target language) without using any human-labeled target-language data, has gained increasing attention. As a general parameter-efficient way, a common solution is to utilize adapter modules to transfer the vision-language alignment ability of Vision-Language Pretraining (VLP) models from a source language to a target language. However, these adapters are usually static once learned, making it difficult to adapt to target-language captions with varied expressions. To alleviate it, we propose Dynamic Adapter with Semantics Disentangling (DASD), whose parameters…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Semantic Web and Ontologies
MethodsAdapter · ALIGN
