Performance and failure modes of AI chatbots on a novel concept inventory on relativity in classical mechanics
Eugenio Tufino, Caterina Giovanzana, Andrea Zamboni, Pasquale Onorato, Stefano Oss

TL;DR
This study evaluates advanced AI chatbots on a new physics concept inventory about relativity, revealing high accuracy but also specific failure modes mainly due to visual misinterpretations, highlighting reliability issues.
Contribution
It introduces a novel, unpublished relativity concept inventory and assesses state-of-the-art LLMs on it, uncovering their strengths and failure patterns in physics understanding.
Findings
Gemini 3 Flash achieved 97% accuracy
Models fail on some items due to visual misinterpretation
Errors are more consistent and item-dependent than student errors
Abstract
AI chatbots are increasingly used by students as study tools in physics, raising practical questions about their reliability on conceptual tasks. Existing evaluations of large language models (LLMs) on physics concept inventories rely almost exclusively on instruments that have been publicly available for years and likely appear in model training data, making it difficult to disentangle physics competence from familiarity with the test items themselves. We address this issue by evaluating three frontier LLMs (GPT-5.2, Gemini 3 Pro, Gemini 3 Flash) on the Classical Relativity Concept Inventory (CRCI), a recently developed and validated 21-item instrument on Galilean relativity that was not publicly available at the time of testing. Each item was administered 30 times per model, and all 1890 responses were qualitatively coded along three dimensions: visual interpretation, physics…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
