Multimodal LLM Enhanced Cross-lingual Cross-modal Retrieval

Yabing Wang; Le Wang; Qiang Zhou; Zhibin Wang; Hao Li; Gang Hua; Wei; Tang

arXiv:2409.19961·cs.CV·October 1, 2024

Multimodal LLM Enhanced Cross-lingual Cross-modal Retrieval

Yabing Wang, Le Wang, Qiang Zhou, Zhibin Wang, Hao Li, Gang Hua, Wei, Tang

PDF

Open Access 1 Repo

TL;DR

This paper introduces LECCR, a novel method that leverages multi-modal large language models to improve cross-lingual cross-modal retrieval by enhancing semantic alignment between visual and non-English textual data.

Contribution

LECCR uses multi-modal large language models to generate detailed visual descriptions and semantic slots, improving alignment between visual and non-English representations in CCR tasks.

Findings

01

LECCR outperforms existing methods on four CCR benchmarks.

02

Semantic slot aggregation enhances visual feature semantics.

03

Softened matching under English guidance improves cross-modal alignment.

Abstract

Cross-lingual cross-modal retrieval (CCR) aims to retrieve visually relevant content based on non-English queries, without relying on human-labeled cross-modal data pairs during training. One popular approach involves utilizing machine translation (MT) to create pseudo-parallel data pairs, establishing correspondence between visual and non-English textual data. However, aligning their representations poses challenges due to the significant semantic gap between vision and text, as well as the lower quality of non-English representations caused by pre-trained encoders and data noise. To overcome these challenges, we propose LECCR, a novel solution that incorporates the multi-modal large language model (MLLM) to improve the alignment between visual and non-English representations. Specifically, we first employ MLLM to generate detailed visual content descriptions and aggregate them into…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

lijiabei-7/leccr
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Semantic Web and Ontologies · Topic Modeling