VulStyle: A Multi-Modal Pre-Training for Code Stylometry-Augmented Vulnerability Detection
Chidera Biringa, Ajmal Abbas, Vishnu Selvaraj, Gokhan Kul

TL;DR
VulStyle is a multi-modal pre-trained model that enhances vulnerability detection by jointly encoding source code, AST structure, and code stylometry features, achieving state-of-the-art results.
Contribution
It introduces a novel multi-modal approach combining non-terminal AST nodes and code stylometry, pre-trained on millions of functions across multiple languages.
Findings
VulStyle outperforms existing models on BigVul and VulDeePecker datasets.
Incorporating CStyle features significantly improves detection accuracy.
Selective AST node encoding reduces complexity while preserving semantic information.
Abstract
We present VulStyle, a multi-modal software vulnerability detection model that jointly encodes function-level source code, non-terminal Abstract Syntax Tree (AST) structure, and code stylometry (CStyle) features. Prior work in code representation primarily leverages token-level models or full AST trees, often missing stylistic cues indicative of risky programming practices, or incurring high structural overhead. Our approach selects only non-terminal AST nodes, reducing input complexity while preserving semantic hierarchy, and integrates syntactic and lexical CStyle features as auxiliary vulnerability signals. VulStyle is pre-trained using masked language modeling on 4.9M functions across seven programming languages, and fine-tuned across five benchmark datasets: Devign, BigVul, DiverseVul, REVEAL, and VulDeePecker. VulStyle achieves state-of-the-art performance on BigVul and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
