MARIO Eval: Evaluate Your Math LLM with your Math LLM--A mathematical   dataset evaluation toolkit

Boning Zhang; Chengxi Li; Kai Fan

arXiv:2404.13925·cs.CL·April 23, 2024

MARIO Eval: Evaluate Your Math LLM with your Math LLM--A mathematical dataset evaluation toolkit

Boning Zhang, Chengxi Li, Kai Fan

PDF

Open Access 4 Repos

TL;DR

This paper introduces a comprehensive evaluation toolkit for mathematical language models that combines a computer algebra system with an optional LLM, enabling more consistent and robust assessments across different datasets.

Contribution

The authors present a unified, generalizable evaluation toolkit for math LLMs that integrates CAS and optional LLM, improving evaluation consistency and robustness.

Findings

01

The toolkit provides more robust evaluation results than prior methods.

02

Incorporating an LLM enhances evaluation accuracy.

03

The toolkit is validated on two distinct datasets.

Abstract

Large language models (LLMs) have been explored in a variety of reasoning tasks including solving of mathematical problems. Each math dataset typically includes its own specially designed evaluation script, which, while suitable for its intended use, lacks generalizability across different datasets. Consequently, updates and adaptations to these evaluation tools tend to occur without being systematically reported, leading to inconsistencies and obstacles to fair comparison across studies. To bridge this gap, we introduce a comprehensive mathematical evaluation toolkit that not only utilizes a python computer algebra system (CAS) for its numerical accuracy, but also integrates an optional LLM, known for its considerable natural language processing capabilities. To validate the effectiveness of our toolkit, we manually annotated two distinct datasets. Our experiments demonstrate that the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMathematics, Computing, and Information Processing · Open Education and E-Learning