RealBench: Benchmarking Verilog Generation Models with Real-World IP Designs

Pengwei Jin; Di Huang; Chongxiao Li; Shuyao Cheng; Yang Zhao; Xinyao Zheng; Jiaguo Zhu; Shuyi Xing; Bohan Dou; Rui Zhang; Zidong Du; Qi Guo; Xing Hu

arXiv:2507.16200·cs.LG·July 23, 2025

RealBench: Benchmarking Verilog Generation Models with Real-World IP Designs

Pengwei Jin, Di Huang, Chongxiao Li, Shuyao Cheng, Yang Zhao, Xinyao Zheng, Jiaguo Zhu, Shuyi Xing, Bohan Dou, Rui Zhang, Zidong Du, Qi Guo, Xing Hu

PDF

Open Access 1 Datasets

TL;DR

RealBench is a new benchmark for evaluating large language models' ability to generate complex, real-world IP-level Verilog code with rigorous verification, revealing current models' limited performance and highlighting the need for improvement.

Contribution

It introduces the first comprehensive benchmark with real-world IP designs, multi-modal specifications, and rigorous verification for Verilog generation tasks.

Findings

01

Best LLMs achieve only 13.3% pass@1 on module-level tasks.

02

Zero success rate on system-level tasks.

03

Highlights the gap between current LLM capabilities and real-world hardware design requirements.

Abstract

The automatic generation of Verilog code using Large Language Models (LLMs) has garnered significant interest in hardware design automation. However, existing benchmarks for evaluating LLMs in Verilog generation fall short in replicating real-world design workflows due to their designs' simplicity, inadequate design specifications, and less rigorous verification environments. To address these limitations, we present RealBench, the first benchmark aiming at real-world IP-level Verilog generation tasks. RealBench features complex, structured, real-world open-source IP designs, multi-modal and formatted design specifications, and rigorous verification environments, including 100% line coverage testbenches and a formal checker. It supports both module-level and system-level tasks, enabling comprehensive assessments of LLM capabilities. Evaluations on various LLMs and agents reveal that even…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

Pengwei-Jin/RealBench
dataset· 367 dl
367 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsFormal Methods in Verification · Software Testing and Debugging Techniques · VLSI and Analog Circuit Testing