Exploring the Landscape for Generative Sequence Models for Specialized Data Synthesis
Mohammad Zbeeb, Mohammad Ghorayeb, Mariam Salman

TL;DR
This paper presents a novel method for generating high-quality synthetic malicious network traffic data by transforming numerical data into text, leveraging multiple generative models to improve generalization and data fidelity.
Contribution
It introduces a unique approach that frames data synthesis as a language modeling task, outperforming existing models in generating complex structured data.
Findings
Our method surpasses state-of-the-art models in data fidelity.
Transforming data into text enhances regularization and generalization.
Open-source code and models facilitate further research.
Abstract
Artificial Intelligence (AI) research often aims to develop models that can generalize reliably across complex datasets, yet this remains challenging in fields where data is scarce, intricate, or inaccessible. This paper introduces a novel approach that leverages three generative models of varying complexity to synthesize one of the most demanding structured datasets: Malicious Network Traffic. Our approach uniquely transforms numerical data into text, re-framing data generation as a language modeling task, which not only enhances data regularization but also significantly improves generalization and the quality of the synthetic data. Extensive statistical analyses demonstrate that our method surpasses state-of-the-art generative models in producing high-fidelity synthetic data. Additionally, we conduct a comprehensive study on synthetic data applications, effectiveness, and evaluation…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Database Systems and Queries
