DeepMath-Creative: A Benchmark for Evaluating Mathematical Creativity of Large Language Models

Xiaoyang Chen; Xinan Dai; Yu Du; Qian Feng; Naixu Guo; Tingshuo Gu; Yuting Gao; Yingyi Gao; Xudong Han; Xiang Jiang; Yilin Jin; Hongyi Lin; Shisheng Lin; Xiangnan Li; Yuante Li; Yixing Li; Zhentao Lai; Zilu Ma; Yingrong Peng; Jiacheng Qian; Hao-Yu Sun; Jianbo Sun; Zirui Wang; Siwei Wu; Zian Wang; Bin Xu; Jianghao Xu; Yiyang Yu; Zichuan Yang; Hongji Zha; Ruichong Zhang

arXiv:2505.08744·cs.AI·May 14, 2025

DeepMath-Creative: A Benchmark for Evaluating Mathematical Creativity of Large Language Models

Xiaoyang Chen, Xinan Dai, Yu Du, Qian Feng, Naixu Guo, Tingshuo Gu, Yuting Gao, Yingyi Gao, Xudong Han, Xiang Jiang, Yilin Jin, Hongyi Lin, Shisheng Lin, Xiangnan Li, Yuante Li, Yixing Li, Zhentao Lai, Zilu Ma, Yingrong Peng, Jiacheng Qian, Hao-Yu Sun, Jianbo Sun, Zirui Wang

PDF

1 Repo

TL;DR

This paper introduces DeepMath-Creative, a benchmark for assessing the mathematical creativity of large language models, revealing current models' limited creative problem-solving abilities across various mathematical domains.

Contribution

It presents a new benchmark dataset for evaluating mathematical creativity in LLMs and systematically assesses mainstream models' performance on constructive problems.

Findings

01

Models achieve up to 70% accuracy on basic tasks

02

Performance drops significantly on complex problems

03

Current models rely on pattern recombination rather than true creativity

Abstract

To advance the mathematical proficiency of large language models (LLMs), the DeepMath team has launched an open-source initiative aimed at developing an open mathematical LLM and systematically evaluating its mathematical creativity. This paper represents the initial contribution of this initiative. While recent developments in mathematical LLMs have predominantly emphasized reasoning skills, as evidenced by benchmarks on elementary to undergraduate-level mathematical tasks, the creative capabilities of these models have received comparatively little attention, and evaluation datasets remain scarce. To address this gap, we propose an evaluation criteria for mathematical creativity and introduce DeepMath-Creative, a novel, high-quality benchmark comprising constructive problems across algebra, geometry, analysis, and other domains. We conduct a systematic evaluation of mainstream LLMs'…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

deepmathllm/deepmath
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.