Unified Multimodal and Multilingual Retrieval via Multi-Task Learning with NLU Integration
Xinyuan Zhang, Lina Zhang, Lisung Chen, Guangyao Liu, Shuai Nie, Jiaming Xu, Runyu Shi, Ying Huang, Guoquan Zhang

TL;DR
This paper introduces a multi-task learning framework that unifies multimodal and multilingual retrieval tasks with integrated natural language understanding, improving efficiency and performance across diverse retrieval scenarios.
Contribution
It is the first to jointly optimize multilingual image retrieval, text retrieval, and NLU tasks within a single unified model, enhancing retrieval accuracy and efficiency.
Findings
Improved retrieval performance across multiple modalities and languages.
Reduced storage and inference overhead compared to separate models.
Enhanced intent understanding through integrated NLU features.
Abstract
Multimodal retrieval systems typically employ Vision Language Models (VLMs) that encode images and text independently into vectors within a shared embedding space. Despite incorporating text encoders, VLMs consistently underperform specialized text models on text-only retrieval tasks. Moreover, introducing additional text encoders increases storage, inference overhead, and exacerbates retrieval inefficiencies, especially in multilingual settings. To address these limitations, we propose a multi-task learning framework that unifies the feature representation across images, long and short texts, and intent-rich queries. To our knowledge, this is the first work to jointly optimize multilingual image retrieval, text retrieval, and natural language understanding (NLU) tasks within a single framework. Our approach integrates image and text retrieval with a shared text encoder that is enhanced…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Image Retrieval and Classification Techniques
