InnoGym: Benchmarking the Innovation Potential of AI Agents
Jintian Zhang, Kewei Xu, Jingsheng Zheng, Zhuoyun Yu, Yuqi Zhu, Yujie Luo, Lanning Wei, Shuofei Qiao, Lun Du, Da Zheng, Shumin Deng, Huajun Chen, Ningyu Zhang

TL;DR
InnoGym is a novel benchmark framework that evaluates AI agents not only on correctness but also on their innovation potential through metrics of performance gain and novelty across diverse scientific and engineering tasks.
Contribution
This work introduces InnoGym, the first benchmark to systematically assess the originality and improvement of AI methods, emphasizing the importance of innovation in scientific AI progress.
Findings
Some agents generate novel approaches but lack robustness.
A gap exists between creativity and effectiveness in AI agents.
Benchmark highlights the need for evaluating both innovation and performance.
Abstract
LLMs and Agents have achieved impressive progress in code generation, mathematical reasoning, and scientific discovery. However, existing benchmarks primarily measure correctness, overlooking the diversity of methods behind solutions. True innovation depends not only on producing correct answers but also on the originality of the approach. We present InnoGym, the first benchmark and framework designed to systematically evaluate the innovation potential of AI agents. InnoGym introduces two complementary metrics: performance gain, which measures improvement over the best-known solutions, and novelty, which captures methodological differences from prior approaches. The benchmark includes 18 carefully curated tasks from real-world engineering and scientific domains, each standardized through resource filtering, evaluator validation, and solution collection. In addition, we provide iGym, a…
Peer Reviews
Decision·ICLR 2026 Poster
- Addresses an important and underexplored aspect: evaluating the novelty of agent-generated solutions, going beyond mere correctness. - Constructs a benchmark with 18 real-world "improvable" tasks. - Provides iGym, a unified and reproducible execution environment that supports long-horizon agent workflows and could benefit the broader research community (if open-sourced). - Conducts systematic evaluation of three popular agent frameworks (MLAB, CodeAct, AIDE), offering concrete insights into cu
- **Novelty metric reliability**: The embedding distances of text are highly sensitive to presentation style—identical solutions with different implementations could yield large distances. Conversely, highly novel ideas (e.g., dropout, residual connections) may appear minor in textual or architectural distance but represent significant conceptual leaps. These are known challenges in novelty estimation; it is not reasonable to expect that this paper solves them completely, but the paper should di
The creativity evaluation framework, suite of tasks and gym could be a helpful resource for the AI community. The approach of evaluating LLM and agents’ creativity through their generation solution and performance is sound and intuitive.
The creativity evaluation framework seems to involve heavy amount of computation for each candidate solution as an individual AI model as training and evaluation have to be done for each of them. Several key details are missing in the main text of the paper which can weaken its clarity and reproducibility, for example, what exactly are the distance functions used for each task (seems to be B.2 COMPARISON PROMPT but not mentioned in main text), how are the tasks processed to be use for creativit
1. Evaluating the model performance on novelty, besides performance gain, is interesting and important. In the context of the benchmark, since the solution space can be grounded with known solutions, the evaluation of novelty can be quantified. 2. This paper is well-written and easy to follow, with clear figurative illustrations. 3. The authors show good comparison with existing benchmarks in Table 1 to consolidate the motivation. 4. The analysis, e.g., prior solution, is well-executed.
1. The definitions in Section 2.3 are very strong and not well-grounded. For example, SWE-Bench is defined as a “solved problem”. However, the optimal solutions to the bugs are clearly not defined or estimated. It is also an unsolved/improvable task for agents. On the other hand, for improvable tasks, how is achieving a new state-of-the-art performance defined as novelty? For example, this might be achieved by simply extending the training or using more in-domain data. 2. The evaluation of nove
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Artificial Intelligence in Games · Machine Learning and Data Classification
