mmJEE-Eval: A Bilingual Multimodal Benchmark for Evaluating Scientific Reasoning in Vision-Language Models

Arka Mukherjee; Shreya Ghosh

arXiv:2511.09339·cs.CL·November 13, 2025

mmJEE-Eval: A Bilingual Multimodal Benchmark for Evaluating Scientific Reasoning in Vision-Language Models

Arka Mukherjee, Shreya Ghosh

PDF

Open Access 1 Datasets

TL;DR

mmJEE-Eval is a bilingual multimodal benchmark designed to evaluate scientific reasoning in vision-language models, revealing significant gaps in reasoning capabilities beyond pattern-matching on complex JEE questions.

Contribution

The paper introduces mmJEE-Eval, a novel bilingual benchmark with complex science questions, to better assess true reasoning in vision-language models beyond existing benchmarks.

Findings

01

State-of-the-art models achieve high accuracy on simple questions.

02

Open-source models plateau at 37-45% accuracy on complex questions.

03

Closed models like GPT-5 show limited error correction under increased reasoning load.

Abstract

Contemporary vision-language models (VLMs) perform well on existing multimodal reasoning benchmarks (78-85\% accuracy on MMMU, MathVista). Yet, these results fail to sufficiently distinguish true scientific reasoning articulation capabilities from pattern-matching. To address this gap, we introduce \textbf{mmJEE-Eval}, a multimodal bilingual (English and Hindi) benchmark comprising 1,460 questions from India's JEE Advanced examination (2019-2025) spanning pre-college Physics, Chemistry, and Mathematics domains. Our evaluation of 17 state-of-the-art models reveals that while frontier VLMs (GPT-5, Gemini 2.5 Pro/Flash) achieve 77-84\% accuracy on held-out 2025 questions, open-source models plateau at 37-45\% despite scaling to 400B parameters, a significant difference not observed on existing benchmarks. While closed frontiers from Google and OpenAI show high problem-solving accuracies…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

ArkaMukherjee/mmJEE-Eval
dataset· 20 dl
20 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Machine Learning in Materials Science · Topic Modeling