VerilogDB: The Largest, Highest-Quality Dataset with a Preprocessing Framework for LLM-based RTL Generation
Paul E. Calzada, Zahin Ibnat, Tanvir Rahman, Kamal Kandula, Danyu Lu, Sujan Kumar Saha, Farimah Farahmandi, Mark Tehranipoor

TL;DR
This paper introduces VerilogDB, a large, high-quality Verilog dataset created through an automated preprocessing pipeline, aimed at improving LLM-based RTL code generation for hardware design automation.
Contribution
The paper presents the construction of the largest high-quality Verilog dataset with an efficient preprocessing framework for LLM fine-tuning in RTL generation.
Findings
VerilogDB contains 20,392 samples and 751 MB of data.
The dataset is verified for syntax and synthesized for quality.
It supports future research in LLM-based hardware design automation.
Abstract
Large Language Models (LLMs) are gaining popularity for hardware design automation, particularly through Register Transfer Level (RTL) code generation. In this work, we examine the current literature on RTL generation using LLMs and identify key requirements for training and fine-tuning datasets. We construct a robust Verilog dataset through an automated three-pronged process involving database (DB) creation and management with PostgreSQL, data collection from code hosting sites like OpenCores and GitHub, and data preprocessing to verify the codes' syntax, run logic synthesis, and extract relevant module metadata. We implement a scalable and efficient DB infrastructure to support analysis and detail our preprocessing pipeline to enforce high-quality data before DB insertion. The resulting dataset comprises 20,392 Verilog samples, 751 MB of Verilog code data, which is the largest…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning in Materials Science · Embedded Systems Design Techniques · Parallel Computing and Optimization Techniques
