Analysing Mathematical Reasoning Abilities of Neural Models
David Saxton, Edward Grefenstette, Felix Hill, Pushmeet Kohli

TL;DR
This paper introduces a new evaluation framework for neural models' mathematical reasoning abilities, focusing on structured problem-solving across various math domains to assess capabilities and failure modes.
Contribution
It develops a comprehensive task suite for evaluating neural architectures on mathematical reasoning, enabling detailed analysis of their problem-solving and generalization skills.
Findings
Significant differences in model performance across architectures.
Models show varying abilities to generalize mathematical knowledge.
The task suite reveals specific failure modes in neural reasoning.
Abstract
Mathematical reasoning---a core ability within human intelligence---presents some unique challenges as a domain: we do not come to understand and solve mathematical problems primarily on the back of experience and evidence, but on the basis of inferring, learning, and exploiting laws, axioms, and symbol manipulation rules. In this paper, we present a new challenge for the evaluation (and eventually the design) of neural architectures and similar system, developing a task suite of mathematics problems involving sequential questions and answers in a free-form textual input/output format. The structured nature of the mathematics domain, covering arithmetic, algebra, probability and calculus, enables the construction of training and test splits designed to clearly illuminate the capabilities and failure-modes of different architectures, as well as evaluate their ability to compose and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
DeepMind Made a Math Test For Neural Networks· youtube
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · AI-based Problem Solving and Planning
