VeriReason: Reinforcement Learning with Testbench Feedback for Reasoning-Enhanced Verilog Generation
Yiting Wang, Guoheng Sun, Wanghao Ye, Gang Qu, Ang Li

TL;DR
VeriReason is a novel reinforcement learning framework that enhances Verilog code generation by integrating testbench feedback and self-checking, significantly improving correctness and generalization over existing models.
Contribution
It introduces VeriReason, the first system combining explicit reasoning with reinforcement learning for RTL Verilog generation, achieving state-of-the-art results.
Findings
Achieves 83.1% correctness on VerilogEval benchmark
Up to 2.8X improvement in first-attempt correctness
Outperforms larger models like GPT-4 Turbo
Abstract
Automating Register Transfer Level (RTL) code generation using Large Language Models (LLMs) offers substantial promise for streamlining digital circuit design and reducing human effort. However, current LLM-based approaches face significant challenges with training data scarcity, poor specification-code alignment, lack of verification mechanisms, and balancing generalization with specialization. Inspired by DeepSeek-R1, we introduce VeriReason, a framework integrating supervised fine-tuning with Guided Reward Proximal Optimization (GRPO) reinforcement learning for RTL generation. Using curated training examples and a feedback-driven reward model, VeriReason combines testbench evaluations with structural heuristics while embedding self-checking capabilities for autonomous error correction. On the VerilogEval Benchmark, VeriReason delivers significant improvements: achieving 83.1%…
Peer Reviews
Decision·Submitted to ICLR 2026
- The authors applies GRPO to a novel problem in data scarce hw frontend design with a well-designed multi-level reward system - The paper showcases impressive results: 83.1% pass@5 on VerilogEval-Machine, outperforming GPT-4 Turbo (83.0%) with much smaller models. The improvements are particularly impressive for smaller models (1.5B: +19.1 points). -- However, this is also a weakness, as VerilogEval-Human numbers lag significantly behind. - The adaptive data filtration strategy (retaining samp
- The paper mentions using VerilogEval but doesn't specify which version (v1 or v2); also failed to explain the differences in model performance for VerilogEval-Machine and VerilogEval-Human. These are important because impressive achievement in the former could be results of eval data contamination, since it was scraped from problems online - Where are the evaluation results for RTLLM and similar benchmarks? - No ablation on GRPO vs other RL algorithms (PPO, DPO) - Using GPT4 to regenerate and
(1) A filtering algorithm for Verilog corpora. A major contribution of this paper is the two-stage adaptive filtration process to collect Verilog modules. These complex steps ensure the stability of GRPO training, especially the combination of the reward function. (2) A reward model with reinforcement learning testbench feedback. The reward score includes three measures from syntactic correctness, functional correctness, and structural similarity. Selected hyperparameters ensure the balance of
(1) Limited novelty and insights. Regarding the methodology, the authors are only performing SFT and GRPO on Qwen2.5, without proposing new training paradigms specifically for Verilog code generation or addressing some unique challenges. The results (in Fig. 2) only showed increasing reward scores, but an increasing reward may be achieved by the model's reward hacking, instead of true improvement in coding ability. Better show the real scores on benchmarks at different training steps. In additio
1. The integration of explicit testbench-driven feedback within a reinforcement learning loop (GRPO) for Verilog code generation is carefully engineered and directly tailored to the domain; 2. The work includes comprehensive ablations that disentangle and validate the individual and combined effects of supervised fine-tuning and GRPO, making the contribution measurable and transparent; 3. This paper open-sources its codebase and curated dataset to advance reproducibility and benchmarking practic
1. In Table 1, baselines such as CodeV[1] and CraftRTL[2] for RTL generation are missed, which demonstrate stronger performance on the VerilogEval benchmark; 2. There are other concurrent works on reinforcement learning-based RTL generation, such as [3] and [4]. It would be better if the authors can give a brief review of these works and clarify the difference between this work and others; 3. The paper includes several manually selected hyperparameters. For instance, on lines 299–300, could the
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsPhysical Unclonable Functions (PUFs) and Hardware Security · Adversarial Robustness in Machine Learning · Formal Methods in Verification
MethodsAttention Is All You Need · Linear Layer · Byte Pair Encoding · Label Smoothing · Dropout · Adam · Multi-Head Attention · Dense Connections · Layer Normalization · Softmax
