Is Agentic AI Ready for Real-World Hardware Engineering? A Deep Dive with Phoenix-bench
Qingyun Zou, Feng Yu, Hongshi Tan, Bingsheng He, WengFai Wong

TL;DR
This paper evaluates the transferability of agentic AI systems from software to hardware engineering using a new benchmark, Phoenix-bench, revealing fundamental task differences and the importance of targeted feedback.
Contribution
Introduces Phoenix-bench, a comprehensive hardware engineering benchmark, and provides a systematic evaluation of various agentic AI models highlighting key challenges.
Findings
Hardware bugs propagate differently than software bugs, affecting agent performance.
Failures are concentrated in design control-flow, FSM bugs, and cross-hierarchy signal tracking.
Test case feedback significantly improves bug localization and fixing accuracy.
Abstract
We ask whether agentic AI systems built for software engineering transfer to realistic hardware engineering. Existing hardware LLM benchmarks isolate sub-tasks but none jointly requires repository navigation, hierarchy-aware localization, Electronic Design Automation (EDA) executable verification, and maintenance-style patching. We introduce \textbf{Phoenix-bench}, a synchronized corpus of 511 verified Verilator instances from 114 GitHub repositories, each shipped with the developer patch, design-flow labels, fail-to-pass and pass-to-pass testbenches, and a Docker-pinned EDA environment so resolved-rate differences reflect agent behavior rather than toolchain availability. Using Phoenix-bench we run a uniform evaluation of four commercial agents and eight open-source agentic structures across four LLM backbones, plus two diagnostic interventions (file-level oracle localization and one…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
