Evaluating Mathematical Reasoning of Large Language Models: A Focus on   Error Identification and Correction

Xiaoyuan Li; Wenjie Wang; Moxin Li; Junrong Guo; Yang Zhang; Fuli Feng

arXiv:2406.00755·cs.CL·June 5, 2024·1 cites

Evaluating Mathematical Reasoning of Large Language Models: A Focus on Error Identification and Correction

Xiaoyuan Li, Wenjie Wang, Moxin Li, Junrong Guo, Yang Zhang, Fuli Feng

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper introduces a new evaluation framework for mathematical reasoning in large language models, focusing on error identification and correction from the examiner's perspective, and provides a dataset and insights for improving LLM performance.

Contribution

It defines four new evaluation tasks for error correction, creates a dataset with annotated error types, and assesses eleven LLMs, highlighting the impact of prompting strategies.

Findings

01

GPT-4 outperforms all models

02

LLaMA-2-7B performs comparably to GPT-3.5 and Gemini Pro

03

Calculation errors are the most challenging

Abstract

The rapid advancement of Large Language Models (LLMs) in the realm of mathematical reasoning necessitates comprehensive evaluations to gauge progress and inspire future directions. Existing assessments predominantly focus on problem-solving from the examinee perspective, overlooking a dual perspective of examiner regarding error identification and correction. From the examiner perspective, we define four evaluation tasks for error identification and correction along with a new dataset with annotated error types and steps. We also design diverse prompts to thoroughly evaluate eleven representative LLMs. Our principal findings indicate that GPT-4 outperforms all models, while open-source model LLaMA-2-7B demonstrates comparable abilities to closed-source models GPT-3.5 and Gemini Pro. Notably, calculation error proves the most challenging error type. Moreover, prompting LLMs with the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

littlecirc1e/eic
pytorchOfficial

Videos

Evaluating Mathematical Reasoning of Large Language Models: A Focus on Error Identification and Correction· underline

Taxonomy

TopicsNatural Language Processing Techniques

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · 15 Ways to Contact How can i speak to someone at Delta Airlines · Cosine Annealing · Softmax · Focus · {Dispute@FaQ-s}How to file a dispute with Expedia? · Layer Normalization · Weight Decay · Attention Dropout · Linear Layer