Generating Synthetic Malware Samples Using Generative AI
Tiffany Bao, Kylie Trousil, Quang Duy Tran, Fabio Di Troia, Younghee Park

TL;DR
This paper introduces a system using generative AI to create synthetic malware samples, improving malware classification, especially for minor classes, by augmenting limited datasets with high-quality, realistic data.
Contribution
It proposes a novel approach combining NLP and various generative models to produce synthetic malware data, enhancing classification performance in cybersecurity.
Findings
Synthetic data improves minor class classification by up to 60%.
Overall malware classification accuracy increases by 8%.
Diffusion-based synthetic data demonstrates high quality and robustness.
Abstract
Malware attacks have a significant negative impact on organizations of varied scales in the field of cybersecurity. Recently, malware researchers have increasingly turned to machine learning techniques to combat sophisticated obfuscation methods used in malware. However, collecting a diverse set of malware samples with various obfuscation techniques is challenging and often takes years, especially for newly developed malware. This issue is further compounded by a well-known limitation of machine learning models: their poor performance when training data is scarce. In this paper, we propose a new system for generating synthetic malware samples to augment imbalanced malware dataset. Our approach decomposes malware binary samples into mnemonic opcode sequences, leveraging natural language processing to extract contextual meaning behind malware opcode features to aid the learning of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
