OptiVerse: A Comprehensive Benchmark towards Optimization Problem Solving

Xinyu Zhang; Boxuan Zhang; Yuchen Wan; Lingling Zhang; YiXing Yao; Bifan Wei; Yaqiang Wu; Jun Liu

arXiv:2604.21510·cs.CL·April 24, 2026

OptiVerse: A Comprehensive Benchmark towards Optimization Problem Solving

Xinyu Zhang, Boxuan Zhang, Yuchen Wan, Lingling Zhang, YiXing Yao, Bifan Wei, Yaqiang Wu, Jun Liu

PDF

1 Datasets

TL;DR

OptiVerse is a new comprehensive benchmark with 1,000 diverse optimization problems designed to evaluate and improve Large Language Models' reasoning and problem-solving abilities across neglected domains.

Contribution

The paper introduces OptiVerse, a broad benchmark covering multiple optimization domains and difficulty levels, and proposes a Dual-View Auditor Agent to enhance LLM performance.

Findings

01

LLMs show significant performance drops on hard problems.

02

Even advanced models like GPT-5.2 and Gemini-3 achieve only around 27% accuracy on hard problems.

03

Error analysis highlights modeling and logic errors as key bottlenecks.

Abstract

While Large Language Models (LLMs) demonstrate remarkable reasoning, complex optimization tasks remain challenging, requiring domain knowledge and robust implementation. However, existing benchmarks focus narrowly on Mathematical Programming and Combinatorial Optimization, hindering comprehensive evaluation. To address this, we introduce OptiVerse, a comprehensive benchmark of 1,000 curated problems spanning neglected domains, including Stochastic Optimization, Dynamic Optimization, Game Optimization, and Optimal Control, across three difficulty levels: Easy, Medium, and Hard. The experiments with 22 LLMs of different sizes reveal sharp performance degradation on hard problems, where even advanced models like GPT-5.2 and Gemini-3 struggle to exceed 27% accuracy. Through error analysis, we identify that modeling & logic errors remain the primary bottleneck. Consequently, we propose a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

Waicheng/OptiVerse
dataset· 58 dl
58 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.