HWE-Bench: Benchmarking LLM Agents on Real-World Hardware Bug Repair Tasks

Fan Cui; Hongyuan Hou; Zizhang Luo; Chenyun Yin; Yun Liang

arXiv:2604.14709·cs.AI·May 6, 2026

HWE-Bench: Benchmarking LLM Agents on Real-World Hardware Bug Repair Tasks

Fan Cui, Hongyuan Hou, Zizhang Luo, Chenyun Yin, Yun Liang

PDF

1 Datasets

TL;DR

HWE-Bench is a comprehensive benchmark for evaluating LLM agents on real-world hardware bug repair tasks across multiple open-source projects, highlighting current performance gaps and challenges.

Contribution

This work introduces HWE-Bench, the first large-scale, repository-level benchmark for hardware bug repair, with an automated pipeline and detailed failure analysis.

Findings

01

Best agent resolves 70.7% of tasks overall.

02

Performance exceeds 90% on smaller cores but drops below 65% on complex SoCs.

03

Failures are mainly due to fault localization, reasoning, and cross-artifact coordination.

Abstract

Existing benchmarks for hardware design primarily evaluate Large Language Models (LLMs) on isolated, component-level tasks such as generating HDL modules from specifications, leaving repository-scale evaluation unaddressed. We introduce HWE-Bench, the first large-scale, repository-level benchmark for evaluating LLM agents on real-world hardware bug repair tasks. HWE-Bench comprises 417 task instances derived from real historical bug-fix pull requests across six major open-source projects spanning both Verilog/SystemVerilog and Chisel, covering RISC-V cores, SoCs, and security roots-of-trust. Each task is grounded in a fully containerized environment where the agent must resolve a real bug report, with correctness validated through the project's native simulation and regression flows. The benchmark is built through a largely automated pipeline that enables efficient expansion to new…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

henryen/hwe-bench
dataset· 596 dl
596 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.