MVL-SIB: A Massively Multilingual Vision-Language Benchmark for   Cross-Modal Topical Matching

Fabian David Schmidt; Florian Schneider; Chris Biemann; Goran; Glava\v{s}

arXiv:2502.12852·cs.CL·February 19, 2025

MVL-SIB: A Massively Multilingual Vision-Language Benchmark for Cross-Modal Topical Matching

Fabian David Schmidt, Florian Schneider, Chris Biemann, Goran, Glava\v{s}

PDF

Open Access 1 Datasets

TL;DR

MVL-SIB is a comprehensive multilingual vision-language benchmark covering 205 languages, revealing that current LVLMs struggle with low-resource languages and multi-image tasks, highlighting areas for future improvement.

Contribution

Introduces MVL-SIB, the largest multilingual VL benchmark to date, and evaluates LVLMs, exposing their limitations in low-resource language understanding and multi-image processing.

Findings

01

LVLMs perform at chance level on low-resource languages like N'Koo.

02

VL support in LVLMs is disproportionately lower than text support for low-resource languages.

03

Open-weight LVLMs do not benefit from multi-image topic representations.

Abstract

Existing multilingual vision-language (VL) benchmarks often only cover a handful of languages. Consequently, evaluations of large vision-language models (LVLMs) predominantly target high-resource languages, underscoring the need for evaluation data for low-resource languages. To address this limitation, we introduce MVL-SIB, a massively multilingual vision-language benchmark that evaluates both cross-modal and text-only topical matching across 205 languages -- over 100 more than the most multilingual existing VL benchmarks encompass. We then benchmark a range of of open-weight LVLMs together with GPT-4o(-mini) on MVL-SIB. Our results reveal that LVLMs struggle in cross-modal topic matching in lower-resource languages, performing no better than chance on languages like N'Koo. Our analysis further reveals that VL support in LVLMs declines disproportionately relative to textual support for…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

WueNLP/mvl-sib
dataset· 205 dl
205 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Advanced Image and Video Retrieval Techniques