HWE-Bench: Benchmarking LLM Agents on Real-World Hardware Bug Repair Tasks
Fan Cui, Hongyuan Hou, Zizhang Luo, Chenyun Yin, Yun Liang

TL;DR
HWE-Bench is a comprehensive benchmark for evaluating LLM agents on real-world hardware bug repair tasks across multiple open-source projects, highlighting current performance gaps and challenges.
Contribution
This work introduces HWE-Bench, the first large-scale, repository-level benchmark for hardware bug repair, with an automated pipeline and detailed failure analysis.
Findings
Best agent resolves 70.7% of tasks overall.
Performance exceeds 90% on smaller cores but drops below 65% on complex SoCs.
Failures are mainly due to fault localization, reasoning, and cross-artifact coordination.
Abstract
Existing benchmarks for hardware design primarily evaluate Large Language Models (LLMs) on isolated, component-level tasks such as generating HDL modules from specifications, leaving repository-scale evaluation unaddressed. We introduce HWE-Bench, the first large-scale, repository-level benchmark for evaluating LLM agents on real-world hardware bug repair tasks. HWE-Bench comprises 417 task instances derived from real historical bug-fix pull requests across six major open-source projects spanning both Verilog/SystemVerilog and Chisel, covering RISC-V cores, SoCs, and security roots-of-trust. Each task is grounded in a fully containerized environment where the agent must resolve a real bug report, with correctness validated through the project's native simulation and regression flows. The benchmark is built through a largely automated pipeline that enables efficient expansion to new…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
