mmJEE-Eval: A Bilingual Multimodal Benchmark for Evaluating Scientific Reasoning in Vision-Language Models
Arka Mukherjee, Shreya Ghosh

TL;DR
mmJEE-Eval is a bilingual multimodal benchmark designed to evaluate scientific reasoning in vision-language models, revealing significant gaps in reasoning capabilities beyond pattern-matching on complex JEE questions.
Contribution
The paper introduces mmJEE-Eval, a novel bilingual benchmark with complex science questions, to better assess true reasoning in vision-language models beyond existing benchmarks.
Findings
State-of-the-art models achieve high accuracy on simple questions.
Open-source models plateau at 37-45% accuracy on complex questions.
Closed models like GPT-5 show limited error correction under increased reasoning load.
Abstract
Contemporary vision-language models (VLMs) perform well on existing multimodal reasoning benchmarks (78-85\% accuracy on MMMU, MathVista). Yet, these results fail to sufficiently distinguish true scientific reasoning articulation capabilities from pattern-matching. To address this gap, we introduce \textbf{mmJEE-Eval}, a multimodal bilingual (English and Hindi) benchmark comprising 1,460 questions from India's JEE Advanced examination (2019-2025) spanning pre-college Physics, Chemistry, and Mathematics domains. Our evaluation of 17 state-of-the-art models reveals that while frontier VLMs (GPT-5, Gemini 2.5 Pro/Flash) achieve 77-84\% accuracy on held-out 2025 questions, open-source models plateau at 37-45\% despite scaling to 400B parameters, a significant difference not observed on existing benchmarks. While closed frontiers from Google and OpenAI show high problem-solving accuracies…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Machine Learning in Materials Science · Topic Modeling
