Frontier-Eng: Benchmarking Self-Evolving Agents on Real-World Engineering Tasks with Generative Optimization

Yizhe Chi; Deyao Hong; Dapeng Jiang; Tianwei Luo; Kaisen Yang; Boshi Zhang; Zhe Cao; Xiaoyan Fan; Bingxiang He; Han Hao; Weiyang Jin; Dianqiao Lei; Qingle Liu; Houde Qian; Bowen Wang; Situ Wang; Youjie Zheng; Yifan Zhou; Calvin Xiao; Eren Cai; Qinhuai Na

arXiv:2604.12290·cs.AI·April 28, 2026

Frontier-Eng: Benchmarking Self-Evolving Agents on Real-World Engineering Tasks with Generative Optimization

Yizhe Chi, Deyao Hong, Dapeng Jiang, Tianwei Luo, Kaisen Yang, Boshi Zhang, Zhe Cao, Xiaoyan Fan, Bingxiang He, Han Hao, Weiyang Jin, Dianqiao Lei, Qingle Liu, Houde Qian, Bowen Wang, Situ Wang, Youjie Zheng, Yifan Zhou, Calvin Xiao, Eren Cai, Qinhuai Na

PDF

TL;DR

Frontier-Eng introduces a comprehensive benchmark for evaluating AI agents in real-world engineering tasks using generative optimization within simulators, emphasizing iterative design and feasibility constraints.

Contribution

The paper presents a new benchmark with industrial-grade simulators for iterative generative optimization, highlighting challenges and insights for AI agent performance in engineering.

Findings

01

GPT 5.4 performs most robustly among evaluated models.

02

Improvement frequency and magnitude follow a dual power-law decay.

03

Depth in models is crucial for achieving improvements under fixed budgets.

Abstract

Current LLM agent benchmarks, which predominantly focus on binary pass/fail tasks such as code generation or search-based question answering, often neglect the value of real-world engineering that is often captured through the iterative optimization of feasible designs. To this end, we introduce Frontier-Eng, a human-verified benchmark for generative optimization -- an iterative propose-execute-evaluate loop in which an agent generates candidate artifacts, receives executable verifier feedback, and revises them under a fixed interaction budget -- spanning $47$ tasks across five broad engineering categories. Unlike previous suites, Frontier-Eng tasks are grounded in industrial-grade simulators and verifiers that provide continuous reward signals and enforce hard feasibility constraints under constrained budgets. We evaluate eight frontier language models using representative search…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.