Multilingual Diversity Improves Vision-Language Representations

Thao Nguyen; Matthew Wallingford; Sebastin Santy; Wei-Chiu Ma; Sewoong Oh; Ludwig Schmidt; Pang Wei Koh; Ranjay Krishna

arXiv:2405.16915·cs.CV·September 16, 2025·2 cites

Multilingual Diversity Improves Vision-Language Representations

Thao Nguyen, Matthew Wallingford, Sebastin Santy, Wei-Chiu Ma, Sewoong Oh, Ludwig Schmidt, Pang Wei Koh, Ranjay Krishna

PDF

Open Access

TL;DR

This paper demonstrates that incorporating multilingual data into vision-language models enhances their performance across diverse tasks and regions, highlighting the value of cultural and linguistic diversity in training datasets.

Contribution

The study systematically shows that multilingual data improves vision-language representations and outperforms English-only datasets on multiple benchmarks.

Findings

01

Multilingual data improves ImageNet and distribution shift performance.

02

Including non-English data benefits geographically diverse tasks like GeoDE.

03

English and non-English data are significantly different in image and text space.

Abstract

Massive web-crawled image-text datasets lay the foundation for recent progress in multimodal learning. These datasets are designed with the goal of training a model to do well on standard computer vision benchmarks, many of which, however, have been shown to be English-centric (e.g., ImageNet). Consequently, existing data curation techniques gravitate towards using predominantly English image-text pairs and discard many potentially useful non-English samples. Our work questions this practice. Multilingual data is inherently enriching not only because it provides a gateway to learn about culturally salient concepts, but also because it depicts common concepts differently from monolingual data. We thus conduct a systematic study to explore the performance benefits of using more samples of non-English origins with respect to English vision tasks. By translating all multilingual image-text…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMedia, Religion, Digital Communication