Free and Fair Hardware: A Pathway to Copyright Infringement-Free Verilog Generation using LLMs
Sam Bush, Matthew DeLorenzo, Phat Tieu, Jeyavijayan Rajendran

TL;DR
This paper introduces a new open-source Verilog dataset and a fine-tuning framework for LLMs, significantly reducing copyright infringement risks while improving hardware code generation quality.
Contribution
It presents FreeSet, a large fair-use Verilog dataset, and a fine-tuning method that minimizes copyright violations in LLM-generated hardware code.
Findings
FreeV model has only a 3% copyright violation rate.
FreeV improves Verilog code generation performance.
The dataset and framework enhance fair use in hardware LLM applications.
Abstract
Limitations in Large Language Model (LLM) capabilities for hardware design tasks, such as generating functional Verilog codes, have motivated various fine-tuning optimizations utilizing curated hardware datasets from open-source repositories. However, these datasets remain limited in size and contain minimal checks on licensing for reuse, resulting in potential copyright violations by fine-tuned LLMs. Therefore, we propose an evaluation benchmark to estimate the risk of Verilog-trained LLMs to generate copyright-protected codes. To minimize this risk, we present an open-source Verilog dataset, FreeSet, containing over 220k files, along with the automated dataset curation framework utilized to provide additional guarantees of fair-use Verilog data. We then execute an LLM fine-tuning framework consisting of continual pre-training, resulting in a fine-tuned Llama model for Verilog, FreeV.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsPhysical Unclonable Functions (PUFs) and Hardware Security · Machine Learning in Materials Science · Scientific Computing and Data Management
MethodsLLaMA
