Evaluating Large Language Models on Multimodal Chemistry Olympiad Exams
Yiming Cui, Xin Yao, Yuxuan Qin, Xin Li, Shijin Wang, Guoping Hu

TL;DR
This paper evaluates the reasoning capabilities of 40 multimodal large language models on chemistry Olympiad questions, highlighting their limitations in modality fusion and the benefits of chain-of-thought prompting.
Contribution
It introduces a new benchmark for multimodal scientific reasoning in chemistry and systematically assesses model performance, revealing key challenges and strategies for improvement.
Findings
Models struggle with modality fusion, sometimes performing worse without images.
Chain-of-Thought prompting improves accuracy and visual grounding.
Current models have significant limitations in scientific reasoning in multimodal contexts.
Abstract
Multimodal scientific reasoning remains a significant challenge for large language models (LLMs), particularly in chemistry, where problem-solving relies on symbolic diagrams, molecular structures, and structured visual data. Here, we systematically evaluate 40 proprietary and open-source multimodal LLMs, including GPT-5, o3, Gemini-2.5-Pro, and Qwen2.5-VL, on a curated benchmark of Olympiad-style chemistry questions drawn from over two decades of U.S. National Chemistry Olympiad (USNCO) exams. These questions require integrated visual and textual reasoning across diverse modalities. We find that many models struggle with modality fusion, where in some cases, removing the image even improves accuracy, indicating misalignment in vision-language integration. Chain-of-Thought prompting consistently enhances both accuracy and visual grounding, as demonstrated through ablation studies and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Machine Learning in Materials Science · Topic Modeling
