Reward-Weighted On-Policy Distillation with an Open Property-Equivalence Verifier for NL-to-SVA Generation
Qingyun Zou, Yingze Li, Tianen Liu, Bingsheng He, and Weng-Fai Wong

TL;DR
This paper introduces RWOPD, a novel on-policy distillation method that improves NL-to-SVA generation accuracy by optimizing property equivalence using a verifier-guided reward system.
Contribution
The paper proposes RWOPD, a property-aware distillation technique that leverages a property-verification reward to enhance SVA correctness in large language models.
Findings
RWOPD achieves state-of-the-art results on NL2SVA benchmarks.
Distilling with property-based rewards surpasses traditional token-level mimicry methods.
The approach improves pass@1, pass@5, and pass@10 metrics over prior models.
Abstract
LLM-based generation of SystemVerilog Assertions (SVA) is often reported as nearing saturation, with the strongest specialized model reaching accuracy on NL2SVA-Human. We show that this aggregate hides a temporal gap: models that appear strong overall still collapse to a few implication templates on bounded-delay and liveness specifications. The core issue is that the dominant recipe, supervised fine-tuning on NL/SVA pairs, optimizes token-level mimicry rather than the \emph{property equivalence} that defines SVA correctness. We introduce \emph{Reward-Weighted On-Policy Distillation} (RWOPD), an on-policy distillation method that samples student rollouts, scores them with an open SymbiYosys+Z3 Property-Equivalence Checker (PEC), and applies a verifier-reward-weighted forward-KL gradient from a frozen 14B teacher on verifier-passable rollouts. This keeps the supervision…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
