ORGEval: Graph-Theoretic Evaluation of LLMs in Optimization Modeling

Zhuohan Wang; Ziwei Zhu; Ziniu Li; Congliang Chen; Yizhou Han; Yufeng Lin; Zhihang Lin; Angyang Gu; Xinglin Hu; Ruoyu Sun; Tian Ding

arXiv:2510.27610·cs.LG·November 3, 2025

ORGEval: Graph-Theoretic Evaluation of LLMs in Optimization Modeling

Zhuohan Wang, Ziwei Zhu, Ziniu Li, Congliang Chen, Yizhou Han, Yufeng Lin, Zhihang Lin, Angyang Gu, Xinglin Hu, Ruoyu Sun, Tian Ding

PDF

Open Access 5 Reviews

TL;DR

ORGEval introduces a graph-theoretic framework to evaluate LLMs in optimization modeling, offering a robust, efficient alternative to solver-based methods by detecting model equivalence through graph isomorphism tests.

Contribution

This work presents ORGEval, a novel graph-based evaluation method for LLMs in optimization, including a theoretical guarantee for symmetry decomposable graphs and a new benchmark dataset.

Findings

01

ORGEval achieves 100% consistency in model equivalence detection.

02

It significantly outperforms solver-based methods in runtime.

03

DeepSeek-V3 and Claude-Opus-4 excel in optimization modeling accuracy.

Abstract

Formulating optimization problems for industrial applications demands significant manual effort and domain expertise. While Large Language Models (LLMs) show promise in automating this process, evaluating their performance remains difficult due to the absence of robust metrics. Existing solver-based approaches often face inconsistency, infeasibility issues, and high computational costs. To address these issues, we propose ORGEval, a graph-theoretic evaluation framework for assessing LLMs' capabilities in formulating linear and mixed-integer linear programs. ORGEval represents optimization models as graphs, reducing equivalence detection to graph isomorphism testing. We identify and prove a sufficient condition, when the tested graphs are symmetric decomposable (SD), under which the Weisfeiler-Lehman (WL) test is guaranteed to correctly detect isomorphism. Building on this, ORGEval…

Peer Reviews

Decision·ICLR 2026 Conference Desk Rejected Submission

Reviewer 01Rating 6Confidence 4

Strengths

* I do agree that there are flaws in checking optimal values to determine the correctness of problem formulation. To my knowledge, this paper is the first to address this important issue. * The authors attempt to rigorously define the model equivalence and various related concepts. * Some definitions need clarification, though. Please see my comments in "Weaknesses". * The authors constructed a new benchmark dataset to check formulation correctness based on the proposed model equivalence metri

Weaknesses

**Soundness and Clarity of Various Definitions.** The authors made several definitions in terms of model $\mathcal{M} (`Definition 2`), model isomorphism (`Definition 5`), and graph (`Definition 7`). It is important to make these definitions consistent and coherent. Below are some comments and questions. * It is not clear what is the exact definition of *formulation correctness*, namely the equivalence between the ground-truth formulation and the formulation given by a LLM. If I understand corre

Reviewer 02Rating 4Confidence 3

Strengths

- Authors made a fair case that current solvers are not efficient or consistent in identifying equivalence in optimization problems, due to difficulty in handling different parameter configurations, computational costs, and inability to handle infeasible instances - Experiments demonstrated that their methodology works for this subclass of problems, with 100% consistency across random parameter configurations - The theoretical characterization of SD graphs and proof that WL-test correctly determ

Weaknesses

- Limited contribution. Essentially, the authors added a check for symmetric decomposable on top of a Weisfeiler–Lehman (WL) test to check whether two optimization models are equivalent. The core pipeline uses the graph representation and WL-test from previous work, and the addition of SD detection being the only novel algorithmic component. - Limited testing on other datasets, only worked on its own benchmark. - The benchmark dataset is also relatvely small (394), and selection / dataset constr

Reviewer 03Rating 2Confidence 4

Strengths

Interesting and relevant topic for using LLM in translation task for optimization problems. Introduction of a new dataset is always a great contribution to the community There is rigor to the new SD condition

Weaknesses

I liked reading this paper but fail to see its significance. At a high-level, the paper moves the assumption around from knowing the optimality to knowing the model. Still, this requires known a ground truth model --which is arguably is the most daunting part of verification. The difficult part in model translation is having the ground truth; either in the form of a model OR the optimal value that it produces. Only after, either of them of is known, one can check the correctness. IF the grou

Reviewer 04Rating 4Confidence 4

Strengths

This paper innovatively uses graph to determine whether the optimization problem modeling is correct.

Weaknesses

- Is it reliable to judge the correctness of a model by examining its graph structure? Are there complex optimization problems where different modeling methods exist, but the final result is always correct? For example, some problems in combinatorial optimization, or problems with duality. - For complex problems, correct modeling and correct solution are not equivalent. For example, some complex problems may have time complexity issues when solving them, and simple modeling methods may prevent t

Reviewer 05Rating 4Confidence 3

Strengths

The authors introduced a new evaluation paradigm that shifts from solver-based numerical checks to graph isomorphism structure evaluation. This shows better runtime efficiency and consistency compared to solver-based baselines on their benchmark. The Bench4Opt benchmark is the first model data separated benchmark for optimization modeling, and provides a valuable resource for evaluating LLM’s modeling capabilities.

Weaknesses

1. The WL-based decision is guaranteed correct only when the symmetric decomposable (SD) condition holds, and all Bench4Opt problems happen to satisfy SD. This risks overestimating real-world coverage if SD is less common. 2. Equivalence over model–data separation is approximated by testing five random parameter draws, but this is under-justified. 3. The evaluation framework relies on strict graph isomorphism, and can unfairly penalize mathematically correct but structurally different formulatio

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Advanced Graph Neural Networks · Multimodal Machine Learning Applications