Numerical reasoning in machine reading comprehension tasks: are we there yet?
Hadeel Al-Negheimish, Pranava Madhyastha, Alessandra Russo

TL;DR
This paper critically examines whether current NLP models truly understand numerical reasoning in machine reading comprehension, revealing that standard metrics may not accurately measure true reasoning capabilities.
Contribution
The study provides a controlled analysis of top models, highlighting limitations of existing metrics in assessing genuine numerical reasoning skills.
Findings
Models perform well on standard metrics but lack true reasoning ability.
Standard benchmarks may overestimate models' understanding of numerical reasoning.
Metrics do not effectively differentiate between superficial pattern matching and genuine reasoning.
Abstract
Numerical reasoning based machine reading comprehension is a task that involves reading comprehension along with using arithmetic operations such as addition, subtraction, sorting, and counting. The DROP benchmark (Dua et al., 2019) is a recent dataset that has inspired the design of NLP models aimed at solving this task. The current standings of these models in the DROP leaderboard, over standard metrics, suggest that the models have achieved near-human performance. However, does this mean that these models have learned to reason? In this paper, we present a controlled study on some of the top-performing model architectures for the task of numerical reasoning. Our observations suggest that the standard metrics are incapable of measuring progress towards such tasks.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Explainable Artificial Intelligence (XAI) · Natural Language Processing Techniques
