GanitBench: A bi-lingual benchmark for evaluating mathematical reasoning in Vision Language Models
Ashutosh Bandooni, Brindha Subburaj

TL;DR
GanitBench is a bilingual benchmark with 1527 math questions in English and Hindi, designed to evaluate vision-language models' reasoning abilities across languages and topics, highlighting current model limitations.
Contribution
Introduces GanitBench, a novel bilingual math reasoning benchmark in English and Hindi, with diverse question formats and evaluation of state-of-the-art models' performance and language biases.
Findings
GPT-4o mini outperforms other models with 38.15% accuracy.
Two-shot Chain-of-Thought improves model performance.
Model accuracy decreases when answering questions in Hindi.
Abstract
Benchmarks for evaluating reasoning among Vision Language Models (VLMs) on several fields and domains are being curated more frequently over the last few years. However these are often monolingual, mostly available in English. Additionally there also is a lack of datasets available in Hindi on tasks apart from comprehension and translation. We introduce GanitBench, a tough benchmark consisting of 1527 vision-only questions covering several topics in Mathematics - available in languages English and Hindi. Collected from two major examinations from India, the JEE Advanced and the CBSE Boards examinations, this benchmark includes questions in the form of images comprising of figures essential to a question as well as text. We evaluate two closed source models for the same, in zero-shot Chain-of-Thought (CoT) and two-shot CoT settings. GPT-4o mini is found to be the more dominant model on…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
