Improving LLM Reasoning for Vulnerability Detection via Group Relative Policy Optimization
Marco Simoni, Aleksandar Fontana, Giulio Rossolini, Andrea Saracino

TL;DR
This paper investigates how Group Relative Policy Optimization (GRPO), a reinforcement learning technique, can improve the reasoning and detection capabilities of Large Language Models in identifying software vulnerabilities, surpassing standard finetuning methods.
Contribution
It introduces a novel application of GRPO for LLMs in vulnerability detection, redefining reward functions and demonstrating performance and reasoning improvements.
Findings
GRPO enhances LLM generalization in vulnerability detection
RL-based training improves reasoning abilities of LLMs
Performance surpasses standard supervised finetuning
Abstract
Improving and understanding the training dynamics and reasoning of Large Language Models (LLMs) has become essential for their deployment in AI-based security tools, such as software vulnerability detection. In this work, we present an extensive study aimed at advancing recent RL-based finetuning techniques for LLMs in the context of vulnerability detection. We start by highlighting key limitations of commonly adopted LLMs, such as their tendency to over-predict certain types of vulnerabilities while failing to detect others. To address this challenge, we explore the use of Group Relative Policy Optimization (GRPO), a recent policy-gradient method, for guiding LLM behavior through structured, rule-based rewards. We enable its application to the vulnerability detection task by redefining its advantage functions and reward signals using annotations from widely used datasets in the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
