Do Multilingual VLMs Reason Equally? A Cross-Lingual Visual Reasoning Audit for Indian Languages
Swastik R

TL;DR
This paper evaluates the cross-lingual visual reasoning capabilities of various vision-language models on Indian languages, revealing significant accuracy drops and exposing English-centric reasoning biases.
Contribution
It introduces the first cross-lingual visual reasoning benchmark for Indian languages and provides a comprehensive analysis of model performance across multiple languages and models.
Findings
Accuracy drops of 9.8-25 percentage points when switching from English to Indian languages.
Dravidian languages suffer up to 13.2 pp more accuracy loss than Indo-Aryan languages.
Chain-of-thought prompting degrades performance on Bengali and Kannada, indicating English-centric reasoning chains.
Abstract
Vision-language models score well on mathematical, scientific, and spatial reasoning benchmarks, yet these evaluations are overwhelmingly English. I present the first cross-lingual visual reasoning audit for Indian languages. 980 questions from MathVista, ScienceQA, and MMMU are translated into Hindi, Tamil, Telugu, Bengali, Kannada, and Marathi using IndicTrans2, with Gemini 2.0 Flash cross-verification on 50 samples per language (inter-translator agreement 0.79-0.84). Eight VLMs, from 7B open-source models to GPT-4o, are evaluated across all seven languages, yielding 68,600 inference records that include text-only and chain-of-thought ablations. I find accuracy drops of 9.8-25 percentage points when switching from English to an Indian language, with Dravidian languages suffering up to 13.2 pp more than Indo-Aryan. Chain-of-thought prompting degrades Bengali (-14.4 pp) and Kannada…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
