How Far Are We? Systematic Evaluation of LLMs vs. Human Experts in Mathematical Contest in Modeling
Yuhang Liu, Heyan Huang, Yizhe Yang, Hongyan Zhao, Zhizhuo Zeng, and Yang Gao

TL;DR
This paper evaluates large language models' ability to solve complex mathematical modeling problems end-to-end, revealing strengths in understanding but persistent weaknesses in execution stages, even with larger models.
Contribution
It introduces a stage-wise evaluation framework aligned with expert judgment, highlighting specific performance gaps in LLMs for real-world problem solving.
Findings
LLMs perform well in problem identification and formulation.
Persistent deficiencies exist in solving, coding, and analysis stages.
Gaps remain even with increased model scale.
Abstract
Large language models (LLMs) have achieved strong performance on reasoning benchmarks, yet their ability to solve real-world problems requiring end-to-end workflows remains unclear. Mathematical modeling competitions provide a stringent testbed for evaluating such end-to-end problem-solving capability. We propose a problem-oriented, stage-wise evaluation framework that assesses LLM performance across modeling stages using expert-verified criteria. We validate the framework's reliability by comparing automatic scores with independent human expert judgments on problems from the China Postgraduate Mathematical Contest in Modeling, demonstrating substantially stronger alignment than existing evaluation schemes. Using this framework, we reveal a comprehension-execution gap in state-of-the-art LLMs: while they perform well in early stages such as problem identification and formulation, they…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
