No Data? No Problem: Synthesizing Security Graphs for Better Intrusion Detection
Yi Huang, Shaofei Li, Yao Guo, Xiangqun Chen, Ding Li, Wajih Ul Hassan

TL;DR
This paper introduces PROVSYN, a hybrid framework for synthesizing provenance graphs to improve intrusion detection, especially under data imbalance, by generating high-fidelity graphs and augmenting training data.
Contribution
PROVSYN is a novel hybrid synthesis framework combining graph and textual data generation, significantly enhancing graph fidelity and detection model performance.
Findings
PROVSYN produces higher-fidelity graphs than baselines across five evaluation metrics.
Augmenting datasets with synthesized graphs improves detection accuracy by up to 38%.
Synthetic graphs mitigate data imbalance, increasing normalized entropy by up to 35%.
Abstract
Provenance graph analysis plays a vital role in intrusion detection, particularly against Advanced Persistent Threats (APTs), by exposing complex attack patterns. While recent systems combine graph neural networks (GNNs) with natural language processing (NLP) to capture structural and semantic features, their effectiveness is limited by class imbalance in real-world data. To address this, we introduce PROVSYN, a novel hybrid provenance graph synthesis framework, which comprises three components: (1) graph structure synthesis via heterogeneous graph generation models, (2) textual attribute synthesis via fine-tuned Large Language Models (LLMs), and (3) five-dimensional fidelity evaluation. Experiments on six benchmark datasets demonstrate that PROVSYN consistently produces higher-fidelity graphs across the five evaluation dimensions compared to four strong baselines. To further…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
