HeuriGym: An Agentic Benchmark for LLM-Crafted Heuristics in Combinatorial Optimization

Hongzheng Chen; Yingheng Wang; Yaohui Cai; Hins Hu; Jiajie Li; Shirley Huang; Chenhui Deng; Rongjian Liang; Shufeng Kong; Haoxing Ren; Samitha Samaranayake; Carla P. Gomes; Zhiru Zhang

arXiv:2506.07972·cs.LG·January 29, 2026

HeuriGym: An Agentic Benchmark for LLM-Crafted Heuristics in Combinatorial Optimization

Hongzheng Chen, Yingheng Wang, Yaohui Cai, Hins Hu, Jiajie Li, Shirley Huang, Chenhui Deng, Rongjian Liang, Shufeng Kong, Haoxing Ren, Samitha Samaranayake, Carla P. Gomes, Zhiru Zhang

PDF

Open Access 1 Repo 3 Reviews

TL;DR

HeuriGym is a new benchmark framework for evaluating LLM-generated heuristics in combinatorial optimization, highlighting current limitations and providing a metric to assess solution quality and effectiveness.

Contribution

The paper introduces HeuriGym, an agentic benchmark for LLMs to generate, evaluate, and refine heuristics in optimization problems, with a novel performance metric QYI.

Findings

01

Top models achieve QYI scores of only 0.6, below expert baseline.

02

Persistent limitations in tool use, planning, and adaptive reasoning are observed.

03

HeuriGym is open-source and aims to improve LLM problem-solving capabilities.

Abstract

While Large Language Models (LLMs) have demonstrated significant advancements in reasoning and agent-based problem-solving, current evaluation methodologies fail to adequately assess their capabilities: existing benchmarks either rely on closed-ended questions prone to saturation and memorization, or subjective comparisons that lack consistency and rigor. In this work, we introduce HeuriGym, an agentic framework designed for evaluating heuristic algorithms generated by LLMs for combinatorial optimization problems, characterized by clearly defined objectives and expansive solution spaces. HeuriGym empowers LLMs to propose heuristics, receive evaluative feedback via code execution, and iteratively refine their solutions. We evaluate nine state-of-the-art models on nine problems across domains such as computer systems, logistics, and biology, exposing persistent limitations in tool use,…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 6Confidence 4

Strengths

1. I personally appreciate that this benchmark supports not only Python but also C++. Most other heuristic-generation benchmarks are limited to Python, so this cross-language support is valuable for real-world deployment scenarios and makes the benchmark more practical for diverse research settings. The released codebase is also clearly structured and of good quality, which will be appreciated by the community. 2. The problem selection process is interesting and reasonable. I was interested by

Weaknesses

1. One concern is about the token usage. In the feedback loop: “After each iteration, we log the LLM-generated solution, execution trace, verification result, and evaluation score.” I feel that it appears token-inefficient and may not scale well to larger problems or longer iterations. Appendix E.9 shows multimillion-token runs, confirming substantial computational overhead. A more sustainable design might summarize or structure feedback (e.g., key errors, constraint metrics) rather than concate

Reviewer 02Rating 8Confidence 4

Strengths

- It is a well-motivated benchmark that challenges the models in their critical agentic capabilities such as tool-augmented reasoning, multi-step planning, and instruction following. - The benchmark involves well-defined continuous objectives, large solution spaces, and agentic settings. They are suitable for benchmarking current fast-evolving LLMs. - The paper conducts an extensive evaluation and verifies that the benchmark is challenging for SOTA models. - The paper reveals current limitations

Weaknesses

- In Table 4, you use the metric SOLVE@10. Does this mean only 10 generations are allowed for LLM+EA frameworks? This may be an unreasonable budget for EA frameworks, which typically require more iterations to achieve performance gains. Also, is it possible to incorporate feedback from your benchmark into these EA frameworks for a fairer comparison? - Is it possible to include black-box problems, as emphasized in ReEvo, to test LLMs’ generalizable reasoning capabilities without relying on inter

Reviewer 03Rating 6Confidence 5

Strengths

LLM-driven heuristic design benchmarking is vital for both algorithm development and LLM communities, serving as an open-ended, challenging, and evaluative benchmark. Three stages with new criteria are designed for this benchmark.

Weaknesses

Further clarification is needed regarding the methodology for designing the benchmark, specifically concerning the tasks, prompts, and evaluation procedures. As a benchmark paper, more comprehensive results, including a greater number of iterations and diverse prompt strategies, are expected.

Code & Models

Repositories

cornell-zhang/heurigym
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsConstraint Satisfaction and Optimization · Machine Learning in Materials Science · Multimodal Machine Learning Applications