Is Agentic AI Ready for Real-World Hardware Engineering? A Deep Dive with Phoenix-bench

Qingyun Zou; Feng Yu; Hongshi Tan; Bingsheng He; WengFai Wong

arXiv:2605.15226·cs.AR·May 18, 2026

Is Agentic AI Ready for Real-World Hardware Engineering? A Deep Dive with Phoenix-bench

Qingyun Zou, Feng Yu, Hongshi Tan, Bingsheng He, WengFai Wong

PDF

TL;DR

This paper evaluates the transferability of agentic AI systems from software to hardware engineering using a new benchmark, Phoenix-bench, revealing fundamental task differences and the importance of targeted feedback.

Contribution

Introduces Phoenix-bench, a comprehensive hardware engineering benchmark, and provides a systematic evaluation of various agentic AI models highlighting key challenges.

Findings

01

Hardware bugs propagate differently than software bugs, affecting agent performance.

02

Failures are concentrated in design control-flow, FSM bugs, and cross-hierarchy signal tracking.

03

Test case feedback significantly improves bug localization and fixing accuracy.

Abstract

We ask whether agentic AI systems built for software engineering transfer to realistic hardware engineering. Existing hardware LLM benchmarks isolate sub-tasks but none jointly requires repository navigation, hierarchy-aware localization, Electronic Design Automation (EDA) executable verification, and maintenance-style patching. We introduce \textbf{Phoenix-bench}, a synchronized corpus of 511 verified Verilator instances from 114 GitHub repositories, each shipped with the developer patch, design-flow labels, fail-to-pass and pass-to-pass testbenches, and a Docker-pinned EDA environment so resolved-rate differences reflect agent behavior rather than toolchain availability. Using Phoenix-bench we run a uniform evaluation of four commercial agents and eight open-source agentic structures across four LLM backbones, plus two diagnostic interventions (file-level oracle localization and one…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.