MVL-SIB: A Massively Multilingual Vision-Language Benchmark for Cross-Modal Topical Matching
Fabian David Schmidt, Florian Schneider, Chris Biemann, Goran, Glava\v{s}

TL;DR
MVL-SIB is a comprehensive multilingual vision-language benchmark covering 205 languages, revealing that current LVLMs struggle with low-resource languages and multi-image tasks, highlighting areas for future improvement.
Contribution
Introduces MVL-SIB, the largest multilingual VL benchmark to date, and evaluates LVLMs, exposing their limitations in low-resource language understanding and multi-image processing.
Findings
LVLMs perform at chance level on low-resource languages like N'Koo.
VL support in LVLMs is disproportionately lower than text support for low-resource languages.
Open-weight LVLMs do not benefit from multi-image topic representations.
Abstract
Existing multilingual vision-language (VL) benchmarks often only cover a handful of languages. Consequently, evaluations of large vision-language models (LVLMs) predominantly target high-resource languages, underscoring the need for evaluation data for low-resource languages. To address this limitation, we introduce MVL-SIB, a massively multilingual vision-language benchmark that evaluates both cross-modal and text-only topical matching across 205 languages -- over 100 more than the most multilingual existing VL benchmarks encompass. We then benchmark a range of of open-weight LVLMs together with GPT-4o(-mini) on MVL-SIB. Our results reveal that LVLMs struggle in cross-modal topic matching in lower-resource languages, performing no better than chance on languages like N'Koo. Our analysis further reveals that VL support in LVLMs declines disproportionately relative to textual support for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Advanced Image and Video Retrieval Techniques
