HeuriGym: An Agentic Benchmark for LLM-Crafted Heuristics in Combinatorial Optimization
Hongzheng Chen, Yingheng Wang, Yaohui Cai, Hins Hu, Jiajie Li, Shirley Huang, Chenhui Deng, Rongjian Liang, Shufeng Kong, Haoxing Ren, Samitha Samaranayake, Carla P. Gomes, Zhiru Zhang

TL;DR
HeuriGym is a new benchmark framework for evaluating LLM-generated heuristics in combinatorial optimization, highlighting current limitations and providing a metric to assess solution quality and effectiveness.
Contribution
The paper introduces HeuriGym, an agentic benchmark for LLMs to generate, evaluate, and refine heuristics in optimization problems, with a novel performance metric QYI.
Findings
Top models achieve QYI scores of only 0.6, below expert baseline.
Persistent limitations in tool use, planning, and adaptive reasoning are observed.
HeuriGym is open-source and aims to improve LLM problem-solving capabilities.
Abstract
While Large Language Models (LLMs) have demonstrated significant advancements in reasoning and agent-based problem-solving, current evaluation methodologies fail to adequately assess their capabilities: existing benchmarks either rely on closed-ended questions prone to saturation and memorization, or subjective comparisons that lack consistency and rigor. In this work, we introduce HeuriGym, an agentic framework designed for evaluating heuristic algorithms generated by LLMs for combinatorial optimization problems, characterized by clearly defined objectives and expansive solution spaces. HeuriGym empowers LLMs to propose heuristics, receive evaluative feedback via code execution, and iteratively refine their solutions. We evaluate nine state-of-the-art models on nine problems across domains such as computer systems, logistics, and biology, exposing persistent limitations in tool use,…
Peer Reviews
Decision·ICLR 2026 Poster
1. I personally appreciate that this benchmark supports not only Python but also C++. Most other heuristic-generation benchmarks are limited to Python, so this cross-language support is valuable for real-world deployment scenarios and makes the benchmark more practical for diverse research settings. The released codebase is also clearly structured and of good quality, which will be appreciated by the community. 2. The problem selection process is interesting and reasonable. I was interested by
1. One concern is about the token usage. In the feedback loop: “After each iteration, we log the LLM-generated solution, execution trace, verification result, and evaluation score.” I feel that it appears token-inefficient and may not scale well to larger problems or longer iterations. Appendix E.9 shows multimillion-token runs, confirming substantial computational overhead. A more sustainable design might summarize or structure feedback (e.g., key errors, constraint metrics) rather than concate
- It is a well-motivated benchmark that challenges the models in their critical agentic capabilities such as tool-augmented reasoning, multi-step planning, and instruction following. - The benchmark involves well-defined continuous objectives, large solution spaces, and agentic settings. They are suitable for benchmarking current fast-evolving LLMs. - The paper conducts an extensive evaluation and verifies that the benchmark is challenging for SOTA models. - The paper reveals current limitations
- In Table 4, you use the metric SOLVE@10. Does this mean only 10 generations are allowed for LLM+EA frameworks? This may be an unreasonable budget for EA frameworks, which typically require more iterations to achieve performance gains. Also, is it possible to incorporate feedback from your benchmark into these EA frameworks for a fairer comparison? - Is it possible to include black-box problems, as emphasized in ReEvo, to test LLMs’ generalizable reasoning capabilities without relying on inter
LLM-driven heuristic design benchmarking is vital for both algorithm development and LLM communities, serving as an open-ended, challenging, and evaluative benchmark. Three stages with new criteria are designed for this benchmark.
Further clarification is needed regarding the methodology for designing the benchmark, specifically concerning the tasks, prompts, and evaluation procedures. As a benchmark paper, more comprehensive results, including a greater number of iterations and diverse prompt strategies, are expected.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsConstraint Satisfaction and Optimization · Machine Learning in Materials Science · Multimodal Machine Learning Applications
