Challenges for AI in Multimodal STEM Assessments: a Human-AI Comparison

Aymeric de Chillaz; Anna Sotnikova; Patrick Jermann; Antoine Bosselut

arXiv:2507.03013·cs.CY·July 8, 2025

Challenges for AI in Multimodal STEM Assessments: a Human-AI Comparison

Aymeric de Chillaz, Anna Sotnikova, Patrick Jermann, Antoine Bosselut

PDF

TL;DR

This study compares human and AI performance on multimodal STEM questions, revealing AI struggles with visual components and offering insights for designing assessments that maintain academic integrity.

Contribution

Introduces a new dataset of STEM questions and analyzes how multimodal features affect AI versus human performance, providing guidance for assessment design.

Findings

01

AI achieves 58.5% accuracy with best prompting strategies

02

Humans outperform AI on visual questions consistently

03

AI performance varies with subject and question features

Abstract

Generative AI systems have rapidly advanced, with multimodal input capabilities enabling reasoning beyond text-based tasks. In education, these advancements could influence assessment design and question answering, presenting both opportunities and challenges. To investigate these effects, we introduce a high-quality dataset of 201 university-level STEM questions, manually annotated with features such as image type, role, problem complexity, and question format. Our study analyzes how these features affect generative AI performance compared to students. We evaluate four model families with five prompting strategies, comparing results to the average of 546 student responses per question. Although the best model correctly answers on average 58.5 % of the questions using majority vote aggregation, human participants consistently outperform AI on questions involving visual components.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.