Multimodal LLM Enhanced Cross-lingual Cross-modal Retrieval
Yabing Wang, Le Wang, Qiang Zhou, Zhibin Wang, Hao Li, Gang Hua, Wei, Tang

TL;DR
This paper introduces LECCR, a novel method that leverages multi-modal large language models to improve cross-lingual cross-modal retrieval by enhancing semantic alignment between visual and non-English textual data.
Contribution
LECCR uses multi-modal large language models to generate detailed visual descriptions and semantic slots, improving alignment between visual and non-English representations in CCR tasks.
Findings
LECCR outperforms existing methods on four CCR benchmarks.
Semantic slot aggregation enhances visual feature semantics.
Softened matching under English guidance improves cross-modal alignment.
Abstract
Cross-lingual cross-modal retrieval (CCR) aims to retrieve visually relevant content based on non-English queries, without relying on human-labeled cross-modal data pairs during training. One popular approach involves utilizing machine translation (MT) to create pseudo-parallel data pairs, establishing correspondence between visual and non-English textual data. However, aligning their representations poses challenges due to the significant semantic gap between vision and text, as well as the lower quality of non-English representations caused by pre-trained encoders and data noise. To overcome these challenges, we propose LECCR, a novel solution that incorporates the multi-modal large language model (MLLM) to improve the alignment between visual and non-English representations. Specifically, we first employ MLLM to generate detailed visual content descriptions and aggregate them into…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Semantic Web and Ontologies · Topic Modeling
