INTEGRALBENCH: Benchmarking LLMs with Definite Integral Problems

Bintao Tang; Xin Yang; Yuhao Wang; Zixuan Qiu; Zimo Ji; Wenyuan Jiang

arXiv:2507.21130·cs.AI·July 30, 2025

INTEGRALBENCH: Benchmarking LLMs with Definite Integral Problems

Bintao Tang, Xin Yang, Yuhao Wang, Zixuan Qiu, Zimo Ji, Wenyuan Jiang

PDF

TL;DR

INTEGRALBENCH is a specialized benchmark for assessing large language models' ability to solve definite integral problems, providing symbolic and numerical solutions along with difficulty annotations to measure performance gaps.

Contribution

It introduces a novel benchmark with annotated difficulty levels for evaluating LLMs on definite integrals, advancing automated mathematical reasoning.

Findings

01

Significant performance gaps among state-of-the-art LLMs.

02

Strong correlation between problem difficulty and model accuracy.

03

Established baseline metrics for integral problem-solving by LLMs.

Abstract

We present INTEGRALBENCH, a focused benchmark designed to evaluate Large Language Model (LLM) performance on definite integral problems. INTEGRALBENCH provides both symbolic and numerical ground truth solutions with manual difficulty annotations. Our evaluation of nine state-of-the-art LLMs reveals significant performance gaps and strong correlations between problem difficulty and model accuracy, establishing baseline metrics for this challenging domain. INTEGRALBENCH aims to advance automated mathematical reasoning by providing a rigorous evaluation framework specifically tailored for definite integral computation.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.