Performance and failure modes of AI chatbots on a novel concept inventory on relativity in classical mechanics

Eugenio Tufino; Caterina Giovanzana; Andrea Zamboni; Pasquale Onorato; Stefano Oss

arXiv:2605.09602·physics.ed-ph·May 12, 2026

Performance and failure modes of AI chatbots on a novel concept inventory on relativity in classical mechanics

Eugenio Tufino, Caterina Giovanzana, Andrea Zamboni, Pasquale Onorato, Stefano Oss

PDF

TL;DR

This study evaluates advanced AI chatbots on a new physics concept inventory about relativity, revealing high accuracy but also specific failure modes mainly due to visual misinterpretations, highlighting reliability issues.

Contribution

It introduces a novel, unpublished relativity concept inventory and assesses state-of-the-art LLMs on it, uncovering their strengths and failure patterns in physics understanding.

Findings

01

Gemini 3 Flash achieved 97% accuracy

02

Models fail on some items due to visual misinterpretation

03

Errors are more consistent and item-dependent than student errors

Abstract

AI chatbots are increasingly used by students as study tools in physics, raising practical questions about their reliability on conceptual tasks. Existing evaluations of large language models (LLMs) on physics concept inventories rely almost exclusively on instruments that have been publicly available for years and likely appear in model training data, making it difficult to disentangle physics competence from familiarity with the test items themselves. We address this issue by evaluating three frontier LLMs (GPT-5.2, Gemini 3 Pro, Gemini 3 Flash) on the Classical Relativity Concept Inventory (CRCI), a recently developed and validated 21-item instrument on Galilean relativity that was not publicly available at the time of testing. Each item was administered 30 times per model, and all 1890 responses were qualitatively coded along three dimensions: visual interpretation, physics…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.