From SFT to RL: Demystifying the Post-Training Pipeline for LLM-based Vulnerability Detection
Youpeng Li, Fuxun Yu, Xinda Wang

TL;DR
This paper systematically investigates post-training techniques for LLM-based vulnerability detection, demonstrating on-policy RL with GRPO outperforms other methods and providing new guidelines for data curation, training stages, rewards, and evaluation.
Contribution
It is the first comprehensive study applying post-training pipelines to vulnerability detection, revealing effective strategies and insights beyond common practices.
Findings
On-policy RL with GRPO outperforms SFT and preference optimization.
Rejection sampling-based SFT is more effective than rationalization supervision.
Root-cause analysis-based evaluation offers more robust assessment.
Abstract
The integration of LLMs into vulnerability detection (VD) has shifted the field toward more interpretable and context-aware analysis. While post-training techniques have shown promise in general coding tasks, their systematic application to VD remains underexplored. In this paper, we present the first comprehensive investigation into the post-training pipeline for LLM-based VD, demonstrating that on-policy RL with GRPO consistently outperforms SFT, off-policy preference optimization methods, and specialized VD LLMs. Our study further reveals VD-specific post-training guidelines and insights beyond common practices: (1) For data curation, contrary to the widespread use of rationalization-based supervision in prior VD work, SFT based on rejection sampling proves more effective, as rationalization can introduce hallucinations; in RL training, the inherently skewed difficulty distribution…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
