VeriReason: Reinforcement Learning with Testbench Feedback for Reasoning-Enhanced Verilog Generation

Yiting Wang; Guoheng Sun; Wanghao Ye; Gang Qu; Ang Li

arXiv:2505.11849·cs.AI·May 20, 2025

VeriReason: Reinforcement Learning with Testbench Feedback for Reasoning-Enhanced Verilog Generation

Yiting Wang, Guoheng Sun, Wanghao Ye, Gang Qu, Ang Li

PDF

Open Access 1 Repo 3 Models 5 Datasets 3 Reviews

TL;DR

VeriReason is a novel reinforcement learning framework that enhances Verilog code generation by integrating testbench feedback and self-checking, significantly improving correctness and generalization over existing models.

Contribution

It introduces VeriReason, the first system combining explicit reasoning with reinforcement learning for RTL Verilog generation, achieving state-of-the-art results.

Findings

01

Achieves 83.1% correctness on VerilogEval benchmark

02

Up to 2.8X improvement in first-attempt correctness

03

Outperforms larger models like GPT-4 Turbo

Abstract

Automating Register Transfer Level (RTL) code generation using Large Language Models (LLMs) offers substantial promise for streamlining digital circuit design and reducing human effort. However, current LLM-based approaches face significant challenges with training data scarcity, poor specification-code alignment, lack of verification mechanisms, and balancing generalization with specialization. Inspired by DeepSeek-R1, we introduce VeriReason, a framework integrating supervised fine-tuning with Guided Reward Proximal Optimization (GRPO) reinforcement learning for RTL generation. Using curated training examples and a feedback-driven reward model, VeriReason combines testbench evaluations with structural heuristics while embedding self-checking capabilities for autonomous error correction. On the VerilogEval Benchmark, VeriReason delivers significant improvements: achieving 83.1%…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 2Confidence 4

Strengths

- The authors applies GRPO to a novel problem in data scarce hw frontend design with a well-designed multi-level reward system - The paper showcases impressive results: 83.1% pass@5 on VerilogEval-Machine, outperforming GPT-4 Turbo (83.0%) with much smaller models. The improvements are particularly impressive for smaller models (1.5B: +19.1 points). -- However, this is also a weakness, as VerilogEval-Human numbers lag significantly behind. - The adaptive data filtration strategy (retaining samp

Weaknesses

- The paper mentions using VerilogEval but doesn't specify which version (v1 or v2); also failed to explain the differences in model performance for VerilogEval-Machine and VerilogEval-Human. These are important because impressive achievement in the former could be results of eval data contamination, since it was scraped from problems online - Where are the evaluation results for RTLLM and similar benchmarks? - No ablation on GRPO vs other RL algorithms (PPO, DPO) - Using GPT4 to regenerate and

Reviewer 02Rating 4Confidence 4

Strengths

(1) A filtering algorithm for Verilog corpora. A major contribution of this paper is the two-stage adaptive filtration process to collect Verilog modules. These complex steps ensure the stability of GRPO training, especially the combination of the reward function. (2) A reward model with reinforcement learning testbench feedback. The reward score includes three measures from syntactic correctness, functional correctness, and structural similarity. Selected hyperparameters ensure the balance of

Weaknesses

(1) Limited novelty and insights. Regarding the methodology, the authors are only performing SFT and GRPO on Qwen2.5, without proposing new training paradigms specifically for Verilog code generation or addressing some unique challenges. The results (in Fig. 2) only showed increasing reward scores, but an increasing reward may be achieved by the model's reward hacking, instead of true improvement in coding ability. Better show the real scores on benchmarks at different training steps. In additio

Reviewer 03Rating 4Confidence 4

Strengths

1. The integration of explicit testbench-driven feedback within a reinforcement learning loop (GRPO) for Verilog code generation is carefully engineered and directly tailored to the domain; 2. The work includes comprehensive ablations that disentangle and validate the individual and combined effects of supervised fine-tuning and GRPO, making the contribution measurable and transparent; 3. This paper open-sources its codebase and curated dataset to advance reproducibility and benchmarking practic

Weaknesses

1. In Table 1, baselines such as CodeV[1] and CraftRTL[2] for RTL generation are missed, which demonstrate stronger performance on the VerilogEval benchmark; 2. There are other concurrent works on reinforcement learning-based RTL generation, such as [3] and [4]. It would be better if the authors can give a brief review of these works and clarify the difference between this work and others; 3. The paper includes several manually selected hyperparameters. For instance, on lines 299–300, could the

Code & Models

Repositories

NellyW8/VeriReason
pytorchOfficial

Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsPhysical Unclonable Functions (PUFs) and Hardware Security · Adversarial Robustness in Machine Learning · Formal Methods in Verification

MethodsAttention Is All You Need · Linear Layer · Byte Pair Encoding · Label Smoothing · Dropout · Adam · Multi-Head Attention · Dense Connections · Layer Normalization · Softmax