IndicVisionBench: Benchmarking Cultural and Multilingual Understanding in VLMs
Ali Faraz, Akash, Shaharukh Khan, Raja Kolla, Akshat Patidar, Suranjan Goswami, Abhinav Ravi, Chandra Khatri, Shubham Agarwal

TL;DR
IndicVisionBench introduces a comprehensive benchmark for evaluating vision-language models on culturally diverse and multilingual tasks in the Indian context, revealing significant performance gaps and promoting inclusive multimodal research.
Contribution
The paper presents the first large-scale culturally and linguistically diverse benchmark for VLMs focused on India, including new datasets and evaluation of multiple models.
Findings
Current VLMs show substantial performance gaps in Indian languages and cultural contexts.
IndicVisionBench provides a new resource for analyzing biases in multimodal models.
Benchmark results highlight the need for more inclusive and culturally aware VLM development.
Abstract
Vision-language models (VLMs) have demonstrated impressive generalization across multimodal tasks, yet most evaluation benchmarks remain Western-centric, leaving open questions about their performance in culturally diverse and multilingual settings. To address this gap, we introduce IndicVisionBench, the first large-scale benchmark centered on the Indian subcontinent. Covering English and 10 Indian languages, our benchmark spans 3 multimodal tasks, including Optical Character Recognition (OCR), Multimodal Machine Translation (MMT), and Visual Question Answering (VQA), covering 6 kinds of question types. Our final benchmark consists of a total of ~5K images and 37K+ QA pairs across 13 culturally grounded topics. In addition, we release a paired parallel corpus of annotations across 10 Indic languages, creating a unique resource for analyzing cultural and linguistic biases in VLMs. We…
Peer Reviews
Decision·ICLR 2026 Poster
- The authors' motivation for this work isintuitive They address a well-known but often-neglected limitation in our field: the Western-centric nature of most vision-language evaluation benchmarks. The paper makes a very convincing argument for the necessity of developing resources that capture greater cultural and linguistic diversity, a direction of research that is becoming increasingly critical. - A major strength of this paper is the comprehensive design of IndicVisionBench. It is not limit
- There is a potential for bias in the data generation process. The VQA pairs were first generated using Gemini models and then corrected by human annotators. While this is a practical and common approach, it introduces a risk of "self-enhancement" bias, where the models used for generation might produce content that is easier for them to evaluate later. The paper would be strengthened by acknowledging and discussing this potential limitation. - The evaluation of open-ended questions relies on
This work addresses an underrepresented language and provides a valuable resource for the multilingual AI community, particularly for the Indic community. In the current era of LLM-generated and synthetic data, I appreciate the thorough effort to involve humans in data creation. The resulting dataset covers multiple tasks and languages. The annotation guidelines and interface are also clearly described, further demonstrating the authors’ commitment to transparency.
Annotator demography is missing, which will be useful information to add in this type of dataset.
- **Comprehensive Multi-task Framework**: The benchmark's tri-modal evaluation approach (VQA, OCR, MMT) provides a holistic assessment of VLM capabilities beyond simple question-answering. The inclusion of adversarial questions represents a good approach to probing deeper cultural knowledge beyond surface-level recognition. - **Linguistic Coverage** : The benchmark covers 10 Indic languages with diverse scripts.
- **Limited Methodological Novelty**: While the cultural focus is valuable, the benchmark construction methodology largely follows established paradigms without introducing novel evaluation frameworks. The reliance on synthetic question generation using commercial models (Gemini-1.5-Flash, Gemini-2.5-Flash) raises concerns about potential biases inherited from these models. - **Self-Preferential Bias and Evaluation Circularity**: The most severe methodological flaw is the systematic use
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Topic Modeling
