Multilingual-To-Multimodal (M2M): Unlocking New Languages with Monolingual Text
Piyush Singh Pasi

TL;DR
The paper presents M2M, a simple yet effective linear alignment method that leverages monolingual text to enable multilingual multimodal models to perform well across languages and tasks, including zero-shot image and audio retrieval.
Contribution
M2M introduces a lightweight linear transformation approach that aligns multilingual text embeddings with multimodal space using only English data, enhancing zero-shot multilingual multimodal performance.
Findings
M2M achieves 94.9% Recall@10 in English image retrieval.
M2M attains 89.5% average Recall@10 across 11 unseen languages.
M2M demonstrates robustness in audio-text retrieval and image generation tasks.
Abstract
Multimodal models excel in English, supported by abundant image-text and audio-text data, but performance drops sharply for other languages due to limited multilingual multimodal resources. Existing solutions rely on machine translation, while advances in multilingual text modeling remain underutilized. We introduce M2M, a lightweight alignment method that learns only a few linear layers--using English text alone--to map multilingual text embeddings into multimodal space. Despite its simplicity, M2M matches baseline performance in English (94.9% Recall@10) and achieves strong zero-shot transfer (89.5% Recall@10 averaged across 11 languages, 10 unseen) on XTD Text-to-Image retrieval. Qualitative t-SNE visualizations show that multilingual embeddings align tightly with multimodal representations, while weight analysis reveals that the transformation reshapes embedding geometry rather than…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Topic Modeling
