Simulating Validity: Modal Decoupling in MLLM Generated Feedback on Science Drawings
Arne Bewersdorff, Nejla Yuruk, Xiaoming Zhai

TL;DR
This study investigates the grounding failures of multimodal large language models in providing valid feedback on students' science drawings, revealing prevalent errors and limitations in current approaches.
Contribution
It uncovers modal decoupling issues in off-the-shelf MLLMs and evaluates error types and mitigation strategies using a substantial dataset of student drawings.
Findings
41.3% of feedback instances contained at least one grounding error.
An inventory-list-first workflow reduced some error categories but did not eliminate false absence errors.
Feedback that appears visually grounded offers limited diagnostic value for invalid instances.
Abstract
In science education, students frequently construct hand-drawn visual models of scientific phenomena. These drawings rely on a visual structure where information is encoded through visual objects, their attributes, and relationships. Multimodal large language models (MLLMs) are increasingly used to generate feedback on students' hand-drawn scientific models. However, the validity of such feedback depends on whether model claims are grounded in the specific visual evidence of the student drawing. This study uncovers grounding failures, consistent with modal decoupling, in off-the-shelf MLLM feedback, where outputs remain pedagogically plausible in form while contradicting the drawing or treating depicted elements as missing. Using N = 150 middle school drawings from a kinetic molecular theory unit spanning five modeling tasks and three competence levels, we generated N = 300 feedback…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
