MalDataGen: A Modular Framework for Synthetic Tabular Data Generation in Malware Detection
Kayua Oleques Paim, Angelo Gaspar Diniz Nogueira, Diego Kreutz, Weverton Cordeiro, Rodrigo Brandao Mansilha

TL;DR
MalDataGen is a flexible, open-source framework that uses advanced deep learning models to generate high-quality synthetic tabular data, improving malware detection performance and utility.
Contribution
It introduces a modular deep learning-based framework for synthetic data generation tailored for malware detection, outperforming existing benchmarks.
Findings
MalDataGen outperforms SDV in utility metrics.
Dual validation confirms data fidelity.
Framework integrates seamlessly into detection pipelines.
Abstract
High-quality data scarcity hinders malware detection, limiting ML performance. We introduce MalDataGen, an open-source modular framework for generating high-fidelity synthetic tabular data using modular deep learning models (e.g., WGAN-GP, VQ-VAE). Evaluated via dual validation (TR-TS/TS-TR), seven classifiers, and utility metrics, MalDataGen outperforms benchmarks like SDV while preserving data utility. Its flexible design enables seamless integration into detection pipelines, offering a practical solution for cybersecurity applications.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Malware Detection Techniques · Network Security and Intrusion Detection · Anomaly Detection Techniques and Applications
