OptiVerse: A Comprehensive Benchmark towards Optimization Problem Solving
Xinyu Zhang, Boxuan Zhang, Yuchen Wan, Lingling Zhang, YiXing Yao, Bifan Wei, Yaqiang Wu, Jun Liu

TL;DR
OptiVerse is a new comprehensive benchmark with 1,000 diverse optimization problems designed to evaluate and improve Large Language Models' reasoning and problem-solving abilities across neglected domains.
Contribution
The paper introduces OptiVerse, a broad benchmark covering multiple optimization domains and difficulty levels, and proposes a Dual-View Auditor Agent to enhance LLM performance.
Findings
LLMs show significant performance drops on hard problems.
Even advanced models like GPT-5.2 and Gemini-3 achieve only around 27% accuracy on hard problems.
Error analysis highlights modeling and logic errors as key bottlenecks.
Abstract
While Large Language Models (LLMs) demonstrate remarkable reasoning, complex optimization tasks remain challenging, requiring domain knowledge and robust implementation. However, existing benchmarks focus narrowly on Mathematical Programming and Combinatorial Optimization, hindering comprehensive evaluation. To address this, we introduce OptiVerse, a comprehensive benchmark of 1,000 curated problems spanning neglected domains, including Stochastic Optimization, Dynamic Optimization, Game Optimization, and Optimal Control, across three difficulty levels: Easy, Medium, and Hard. The experiments with 22 LLMs of different sizes reveal sharp performance degradation on hard problems, where even advanced models like GPT-5.2 and Gemini-3 struggle to exceed 27% accuracy. Through error analysis, we identify that modeling & logic errors remain the primary bottleneck. Consequently, we propose a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
