An Independent Evaluation of ChatGPT on Mathematical Word Problems (MWP)
Paulo Shakarian, Abhinav Koyyalamudi, Noel Ngu, Lakshmivihari Mareedu

TL;DR
This paper independently evaluates ChatGPT's performance on mathematical word problems, revealing significant variability based on whether it shows its work, and analyzes factors influencing its success, while providing datasets and baseline models for future research.
Contribution
First independent assessment of ChatGPT on math word problems, including analysis of factors affecting performance and release of datasets and baseline models for future work.
Findings
Performance drops from 84% to 80% when showing work.
Failure probability increases linearly with number of addition/subtraction operations.
Released dataset of ChatGPT responses and baseline predictive models.
Abstract
We study the performance of a commercially available large language model (LLM) known as ChatGPT on math word problems (MWPs) from the dataset DRAW-1K. To our knowledge, this is the first independent evaluation of ChatGPT. We found that ChatGPT's performance changes dramatically based on the requirement to show its work, failing 20% of the time when it provides work compared with 84% when it does not. Further several factors about MWPs relating to the number of unknowns and number of operations that lead to a higher probability of failure when compared with the prior, specifically noting (across all experiments) that the probability of failure increases linearly with the number of addition and subtraction operations. We also have released the dataset of ChatGPT's responses to the MWPs to support further work on the characterization of LLM performance and present baseline machine…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Text Readability and Simplification · Natural Language Processing Techniques
