ToxiGAN: Toxic Data Augmentation via LLM-Guided Directional Adversarial Generation
Peiran Li, Jan Fillies, Adrian Paschke

TL;DR
ToxiGAN is a novel framework that uses large language models and adversarial training to generate diverse, class-specific toxic language data, improving toxicity classifier robustness.
Contribution
It introduces a dynamic, LLM-guided adversarial augmentation method with semantic ballast and directional training to enhance toxic data generation.
Findings
Outperforms existing augmentation methods on hate speech benchmarks.
Improves macro-F1 and hate-F1 scores significantly.
Semantic ballast and directional training enhance robustness.
Abstract
Augmenting toxic language data in a controllable and class-specific manner is crucial for improving robustness in toxicity classification, yet remains challenging due to limited supervision and distributional skew. We propose ToxiGAN, a class-aware text augmentation framework that combines adversarial generation with semantic guidance from large language models (LLMs). To address common issues in GAN-based augmentation such as mode collapse and semantic drift, ToxiGAN introduces a two-step directional training strategy and leverages LLM-generated neutral texts as semantic ballast. Unlike prior work that treats LLMs as static generators, our approach dynamically selects neutral exemplars to provide balanced guidance. Toxic samples are explicitly optimized to diverge from these exemplars, reinforcing class-specific contrastive signals. Experiments on four hate speech benchmarks show that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsHate Speech and Cyberbullying Detection · Adversarial Robustness in Machine Learning · Topic Modeling
