Multilingual-To-Multimodal (M2M): Unlocking New Languages with Monolingual Text

Piyush Singh Pasi

arXiv:2601.10096·cs.LG·January 22, 2026

Multilingual-To-Multimodal (M2M): Unlocking New Languages with Monolingual Text

Piyush Singh Pasi

PDF

Open Access 3 Datasets 1 Video

TL;DR

The paper presents M2M, a simple yet effective linear alignment method that leverages monolingual text to enable multilingual multimodal models to perform well across languages and tasks, including zero-shot image and audio retrieval.

Contribution

M2M introduces a lightweight linear transformation approach that aligns multilingual text embeddings with multimodal space using only English data, enhancing zero-shot multilingual multimodal performance.

Findings

01

M2M achieves 94.9% Recall@10 in English image retrieval.

02

M2M attains 89.5% average Recall@10 across 11 unseen languages.

03

M2M demonstrates robustness in audio-text retrieval and image generation tasks.

Abstract

Multimodal models excel in English, supported by abundant image-text and audio-text data, but performance drops sharply for other languages due to limited multilingual multimodal resources. Existing solutions rely on machine translation, while advances in multilingual text modeling remain underutilized. We introduce M2M, a lightweight alignment method that learns only a few linear layers--using English text alone--to map multilingual text embeddings into multimodal space. Despite its simplicity, M2M matches baseline performance in English (94.9% Recall@10) and achieves strong zero-shot transfer (89.5% Recall@10 averaged across 11 languages, 10 unseen) on XTD Text-to-Image retrieval. Qualitative t-SNE visualizations show that multilingual embeddings align tightly with multimodal representations, while weight analysis reveals that the transformation reshapes embedding geometry rather than…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

Videos

Multilingual-To-Multimodal (M2M): Unlocking New Languages with Monolingual Text· underline

Taxonomy

TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Topic Modeling