Multilingual Training and Evaluation Resources for Vision-Language Models

Daniela Baiamonte; Elena Fano; Matteo Gabburo; Stefano Simonazzi; Leonardo Rigutini; Andrea Zugarini

arXiv:2604.18347·cs.CL·April 21, 2026

Multilingual Training and Evaluation Resources for Vision-Language Models

Daniela Baiamonte, Elena Fano, Matteo Gabburo, Stefano Simonazzi, Leonardo Rigutini, Andrea Zugarini

PDF

TL;DR

This paper introduces multilingual training and evaluation resources for Vision-Language Models across five European languages, enhancing training datasets and benchmarks to improve model performance in multilingual settings.

Contribution

The work provides a comprehensive multilingual dataset and benchmarks for VLMs, created through regeneration-translation, and demonstrates the benefits of multilingual training.

Findings

01

Multilingual data improves VLM performance on non-English benchmarks.

02

Using multilingual, multimodal examples benefits VLM training across languages.

03

Positive transfer observed from multilingual training to English benchmarks.

Abstract

Vision Language Models (VLMs) achieved rapid progress in the recent years. However, despite their growth, VLMs development is heavily grounded on English, leading to two main limitations: (i) the lack of multilingual and multimodal datasets for training, and (ii) the scarcity of comprehensive evaluation benchmarks across languages. In this work, we address these gaps by introducing a new comprehensive suite of resources for VLMs training and evaluation spanning five European languages (English, French, German, Italian, and Spanish). We adopt a regeneration-translation paradigm that produces high-quality cross-lingual resources by combining curated synthetic generation and manual annotation. Specifically, we build Multi-PixMo, a training corpus obtained regenerating examples from Pixmo pre-existing datasets with permissively licensed models: PixMo-Cap, PixMo-AskModelAnything, and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.