Comprehensive Verilog Design Problems: A Next-Generation Benchmark Dataset for Evaluating Large Language Models and Agents on RTL Design and Verification

Nathaniel Pinckney; Chenhui Deng; Chia-Tung Ho; Yun-Da Tsai; Mingjie Liu; Wenfei Zhou; Brucek Khailany; Haoxing Ren

arXiv:2506.14074·cs.LG·June 18, 2025

Comprehensive Verilog Design Problems: A Next-Generation Benchmark Dataset for Evaluating Large Language Models and Agents on RTL Design and Verification

Nathaniel Pinckney, Chenhui Deng, Chia-Tung Ho, Yun-Da Tsai, Mingjie Liu, Wenfei Zhou, Brucek Khailany, Haoxing Ren

PDF

Open Access 1 Repo 1 Datasets

TL;DR

The CVDP benchmark provides a comprehensive, challenging dataset of Verilog design problems to evaluate and advance large language models and agents in hardware design and verification, highlighting current limitations.

Contribution

It introduces a large, realistic benchmark dataset with diverse tasks and evaluation methods, enabling systematic assessment of LLMs and agents in RTL design and verification.

Findings

01

State-of-the-art models achieve only 34% pass@1 on code generation.

02

Agentic tasks, especially RTL reuse and verification, are particularly challenging.

03

CVDP exposes significant gaps in current model capabilities for hardware design automation.

Abstract

We present the Comprehensive Verilog Design Problems (CVDP) benchmark, a new dataset and infrastructure to advance LLM and agent research in hardware design and verification. CVDP includes 783 problems across 13 task categories, covering RTL generation, verification, debugging, specification alignment, and technical Q&A authored by experienced hardware engineers. Problems are offered in both non-agentic and agentic formats. The benchmark introduces more realistic and challenging contexts than prior work, with state-of-the-art models achieving no more than 34% pass@1 on code generation. Agentic tasks $\unicode x 2013$ especially those involving RTL reuse and verification $\unicode x 2013$ are particularly difficult. Evaluation uses open-source tools and model scoring infrastructure, with comprehension tasks assessed via BLEU and LLM-based judging. CVDP reveals substantial gaps in current…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

nvlabs/cvdp_benchmark
noneOfficial

Datasets

AbiralArch/hardware-cvdp-complete
dataset· 66 dl
66 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques