OptiBench Meets ReSocratic: Measure and Improve LLMs for Optimization Modeling
Zhicheng Yang, Yiwei Wang, Yinya Huang, Zhijiang Guo, Wei Shi, Xiongwei Han, Liang Feng, Linqi Song, Xiaodan Liang, Jing Tang

TL;DR
This paper introduces OptiBench, a comprehensive benchmark for evaluating LLMs on complex optimization problems, and proposes ReSocratic, a data synthesis method that enhances open-source models' performance through fine-tuning.
Contribution
The paper presents OptiBench for realistic optimization problem evaluation and ReSocratic, a novel data synthesis approach for improving open-source LLMs in optimization tasks.
Findings
ReSocratic-29k dataset significantly boosts open-source LLM performance.
OptiBench covers diverse linear and nonlinear optimization problems.
Fine-tuning with ReSocratic-29k enhances LLM problem-solving abilities.
Abstract
Large language models (LLMs) have exhibited their problem-solving abilities in mathematical reasoning. Solving realistic optimization (OPT) problems in application scenarios requires advanced and applied mathematics ability. However, current OPT benchmarks that merely solve linear programming are far from complex realistic situations. In this work, we propose OptiBench, a benchmark for End-to-end optimization problem-solving with human-readable inputs and outputs. OptiBench contains rich optimization problems, including linear and nonlinear programming with or without tabular data, which can comprehensively evaluate LLMs' solving ability. In our benchmark, LLMs are required to call a code solver to provide precise numerical answers. Furthermore, to alleviate the data scarcity for optimization problems, and to bridge the gap between open-source LLMs on a small scale (e.g., Llama-3-8b)…
Peer Reviews
Decision·ICLR 2025 Poster
OPTIBENCH is a comprehensive benchmark that effectively evaluates the optimization problem-solving abilities. It inlcudes nonlinear programming problems, along with tabular data, reflecting realistic scenarios. It also makes the benchmark challenging enough. By requiring LLMs to understand the problem, perform sound reasoning, and generate code to invoke a solver, OPTIBENCH provides a holistic assessment of LLMs' reasoning and coding skills. This general evaluation approach is valuable for meas
The paper lacks a fine-grained error analysis, which could provide valuable insights into the specific challenges faced by LLMs in optimization tasks. A breakdown of error types, such as errors in understanding the problem, formulating the optimization model, or transferring the mathematical model to code, or summarizing the execution into required formatted outputs, would be helpful to guide future improvements in LLM design and training. The authors didn't provide an analysis (maybe just a fe
- This paper proposes a benchmark containing various problems, including linear and non-linear programming with or without tabular data, which can better evaluate the ability of LLMs. - The reverse data synthesis approach is novel and reasonable. - The experimental results show that ReSocratic outperforms the forward data synthesis method, and the fine-tuning results are promising.
- The authors may want to generate instances with more constraints and variables, as few instances in the paper have more than 7 variables. Thus, this raises my concern about LLMs' ability to model problems with large instance sizes. - Given that a single optimization problem can have multiple valid formulations, it would be beneficial for the authors to verify the accuracy and equivalence of these formulations with ground-truth ones. - There are questions regarding the solving efficiency of the
Originality: This work introduces the OPTIBENCH benchmark, a comprehensive tool designed to rigorously evaluate large language models (LLMs) on complex optimization problems, significantly advancing beyond existing benchmarks like MAMO and NLP4LP. Unlike prior efforts that primarily focus on simplistic, linear problems or abstract formulations, OPTIBENCH incorporates nonlinear and tabular data, simulating real-world scenarios more accurately. Additionally, the introduction of the ReSocratic data
## Benchmark Contribution: The present paper introduces OPTIBENCH for the evaluation of large language models in respect to solving optimization problems. However, such benchmarks already exist, for example MAMO, ComplexOR, and NLP4LP, which test LLMs concerning their ability to interpret natural language descriptions of optimization problems and create corresponding mathematical models. The extension in OPTIBENCH relates to nonlinear elements and tabular data. This feels like an incremental ext
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsFuzzy Logic and Control Systems · Software Engineering Research · Intelligent Tutoring Systems and Adaptive Learning
MethodsAttention Is All You Need · Residual Connection · Byte Pair Encoding · Layer Normalization · Label Smoothing · Linear Layer · Adam · Dropout · Multi-Head Attention · Dense Connections
