Evaluating vision-capable chatbots in interpreting kinematics graphs: a comparative study of free and subscription-based models
Giulia Polverini, Bor Gregorcic

TL;DR
This study evaluates the visual interpretation abilities of eight large multimodal chatbots on kinematics graphs, revealing that OpenAI's models outperform others and highlighting the influence of task type on chatbot performance.
Contribution
It provides a comparative analysis of free and subscription-based multimodal chatbots' performance on graph interpretation tasks in STEM, an area with limited prior research.
Findings
OpenAI's chatbots outperform others in graph interpretation.
ChatGPT-4o achieves the best overall performance.
Tasks with linguistic input are easier than visual interpretation tasks.
Abstract
This study investigates the performance of eight large multimodal model (LMM)-based chatbots on the Test of Understanding Graphs in Kinematics (TUG-K), a research-based concept inventory. Graphs are a widely used representation in STEM and medical fields, making them a relevant topic for exploring LMM-based chatbots' visual interpretation abilities. We evaluated both freely available chatbots (Gemini 1.0 Pro, Claude 3 Sonnet, Microsoft Copilot, and ChatGPT-4o) and subscription-based ones (Gemini 1.0 Ultra, Gemini 1.5 Pro API, Claude 3 Opus, and ChatGPT-4). We found that OpenAI's chatbots outperform all the others, with ChatGPT-4o showing the overall best performance. Contrary to expectations, we found no notable differences in the overall performance between freely available and subscription-based versions of Gemini and Claude 3 chatbots, with the exception of Gemini 1.5 Pro, available…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAI in Service Interactions
