OPT-BENCH: Evaluating LLM Agent on Large-Scale Search Spaces Optimization Problems
Xiaozhe Li, Jixuan Chen, Xinyu Fang, Shengyuan Ding, Haodong Duan, Qingwen Liu, Kai Chen

TL;DR
This paper introduces OPT-BENCH, a comprehensive benchmark for evaluating large language models on large-scale search space optimization problems, emphasizing iterative reasoning and solution refinement.
Contribution
It presents a new benchmark and an end-to-end optimization framework, OPT-Agent, for assessing LLMs' capabilities in complex, real-world optimization tasks.
Findings
Historical context improves solution quality.
Model performance varies with iterations and temperature.
Open-sourced datasets and tools facilitate further research.
Abstract
Large Language Models (LLMs) have shown remarkable capabilities in solving diverse tasks. However, their proficiency in iteratively optimizing complex solutions through learning from previous feedback remains insufficiently explored. To bridge this gap, we present OPT-BENCH, a comprehensive benchmark designed to evaluate LLM agents on large-scale search space optimization problems. OPT-BENCH includes 20 real-world machine learning tasks sourced from Kaggle and 10 classical NP problems, offering a diverse and challenging environment for assessing LLM agents on iterative reasoning and solution refinement. To enable rigorous evaluation, we introduce OPT-Agent, an end-to-end optimization framework that emulates human reasoning when tackling complex problems by generating, validating, and iteratively improving solutions through leveraging historical feedback. Through extensive experiments on…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
The paper introduces OPT-Agent, an evaluation pipeline that mimics how modern LLM agents actually operate. Comparing to the single-pass inference setups used by many other benchmarks, OPT-Agent enables multi-round refinement by feeding historical feedback and prior attempts back into the context. This design captures agentic problem solving and generates a more informative evaluation.
1. Limited contributions of the benchmark. For the ML tasks, OPT-BENCH-ML looks similar to MLE-bench, where the metrics focus on the comparison to human expert and performance on the test set. It does not provide the intrinsic explanations for the LLM performance on ML tasks. For example, if the OPT-BENCH-ML wants to provide more explainability, it may focus on the diversity of proposed ML solutions, success rate of correctly implementing a ML algorithm, etc. These low-level evaluations are more
1. The paper attempts to address an underexplored area: the ability of LLMs to iteratively refine solutions based on historical feedback , moving beyond static, single-shot evaluations. This focus on learning from both successes and failures over time is a relevant research direction, as it aims to evaluate a more complex aspect of reasoning that many current benchmarks overlook. 2. The authors have curated a new benchmark, OPT-BENCH, which organizes 20 ML tasks and 10 NP problems into a struct
1. The benchmark's scale is too small for statistical significance. With only 20 ML tasks and 10 NP problems (each with just 5 instances), the benchmark lacks the scale to draw reliable conclusions. This problem is exacerbated by the high heterogeneity of the ML tasks, which use disparate metrics (e.g., RMSE, MAE, ROC AUC). As the authors rightly admit in the appendix, averaging these metrics "may introduce scale inconsistencies". 2. A Flawed Evaluation Paradigm for NP Problems: The paper task
1. The paper addresses an important problem. Most current LLM benchmarks focus on single-turn, static QA or reasoning. Evaluating LLMs on complex, long-horizon optimization tasks that require iterative refinement and learning from historical feedback is a critical next step for LLM agent research. 2. OPT-BENCH's feature lies in its combination of two distinct but challenging domains: real-world ML problems (requiring code generation, hyperparameter tuning, and data understanding) and classical N
1. One of the paper's core contributions, OPT-BENCH, is essentially a collection and reformatting of existing problems (Kaggle, classical NP problems), rather than the creation of new, specifically designed evaluation tasks. 2. Compared to benchmarks like MLE-Bench (75 tasks), the number of ML tasks in OPT-BENCH (20) seems small. More importantly, the paper does not adequately justify the representativeness of these 20 tasks. Are they primarily biased towards tabular data? Do they cover other ML
- The benchmark covers both ML and NP-hard problems, with clear task definitions, evaluation metrics, and human expert baselines. - OPT-Agent’s workflow (draft, improve, debug) closely mirrors human iterative problem-solving. - The experiments are extensive, covering a wide range of LLMs (proprietary and open-source, 3B–72B parameters), and include ablation studies on temperature and optimization steps. - Results are reported with multiple metrics (Win Count, Buggy Rate, Average Ratio, Improveme
- While OPT-Agent is well-implemented, its core workflow (draft, improve, debug) is conceptually similar to existing agent frameworks. The main novelty lies in the benchmark and evaluation protocol. - The paper notes that historical feedback is less effective for NP problems, as LLMs often fail to incrementally refine solutions and instead generate new ones. More analysis or proposed solutions for this limitation would strengthen the work. - As acknowledged, averaging performance across diverse
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Artificial Intelligence in Healthcare and Education · Machine Learning and Data Classification
