TL;DR
This paper introduces a causal framework to analyze and quantify the robustness of language models in mathematical reasoning, revealing that larger models like GPT-3 Davinci significantly outperform smaller variants in robustness and sensitivity.
Contribution
It proposes a novel causal analysis framework grounded in behavioral testing to evaluate the robustness of language models in mathematical reasoning tasks.
Findings
GPT-3 Davinci (175B) shows superior robustness and sensitivity.
Robustness does not improve steadily with model size.
Behavioral analysis reveals reliance on shallow patterns in problem descriptions.
Abstract
We have recently witnessed a number of impressive results on hard mathematical reasoning problems with language models. At the same time, the robustness of these models has also been called into question; recent works have shown that models can rely on shallow patterns in the problem description when generating a solution. Building on the idea of behavioral testing, we propose a novel framework, which pins down the causal effect of various factors in the input, e.g., the surface form of the problem text, the operands, and math operators on the output solution. By grounding the behavioral analysis in a causal graph describing an intuitive reasoning process, we study the behavior of language models in terms of robustness and sensitivity to direct interventions in the input space. We apply our framework on a test bed of math word problems. Our analysis shows that robustness does not appear…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
Methods15 Ways to Contact How can i speak to someone at Delta Airlines · Multi-Head Attention · Attention Is All You Need · {Dispute@FaQ-s}How to file a dispute with Expedia? · GPT-3 · Test · Linear Layer · Cosine Annealing · Refunds@Expedia|||How do I get a full refund from Expedia? · Dense Connections
