Text generation for dataset augmentation in security classification tasks
Alexander P. Welsh, Matthew Edwards

TL;DR
This paper explores using advanced language models like GPT-3 to generate synthetic data for improving security-related text classifiers, especially when positive samples are scarce, showing significant performance gains.
Contribution
It introduces novel GPT-3 based data augmentation methods tailored for security classification tasks and evaluates their effectiveness against existing strategies.
Findings
GPT-3 augmentation improves classifier performance
Significant benefits in low positive sample scenarios
Outperforms basic augmentation methods
Abstract
Security classifiers, designed to detect malicious content in computer systems and communications, can underperform when provided with insufficient training data. In the security domain, it is often easy to find samples of the negative (benign) class, and challenging to find enough samples of the positive (malicious) class to train an effective classifier. This study evaluates the application of natural language text generators to fill this data gap in multiple security-related text classification tasks. We describe a variety of previously-unexamined language-model fine-tuning approaches for this purpose and consider in particular the impact of disproportionate class-imbalances in the training set. Across our evaluation using three state-of-the-art classifiers designed for offensive language detection, review fraud detection, and SMS spam detection, we find that models trained with…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHate Speech and Cyberbullying Detection · Spam and Phishing Detection · Authorship Attribution and Profiling
MethodsMulti-Head Attention · 15 Ways to Contact How can i speak to someone at Delta Airlines · Attention Is All You Need · Attention Dropout · Softmax · Dense Connections · Cosine Annealing · Adam · Residual Connection · Byte Pair Encoding
