ToxiGAN: Toxic Data Augmentation via LLM-Guided Directional Adversarial Generation

Peiran Li; Jan Fillies; Adrian Paschke

arXiv:2601.03121·cs.CL·January 7, 2026

ToxiGAN: Toxic Data Augmentation via LLM-Guided Directional Adversarial Generation

Peiran Li, Jan Fillies, Adrian Paschke

PDF

Open Access 1 Video

TL;DR

ToxiGAN is a novel framework that uses large language models and adversarial training to generate diverse, class-specific toxic language data, improving toxicity classifier robustness.

Contribution

It introduces a dynamic, LLM-guided adversarial augmentation method with semantic ballast and directional training to enhance toxic data generation.

Findings

01

Outperforms existing augmentation methods on hate speech benchmarks.

02

Improves macro-F1 and hate-F1 scores significantly.

03

Semantic ballast and directional training enhance robustness.

Abstract

Augmenting toxic language data in a controllable and class-specific manner is crucial for improving robustness in toxicity classification, yet remains challenging due to limited supervision and distributional skew. We propose ToxiGAN, a class-aware text augmentation framework that combines adversarial generation with semantic guidance from large language models (LLMs). To address common issues in GAN-based augmentation such as mode collapse and semantic drift, ToxiGAN introduces a two-step directional training strategy and leverages LLM-generated neutral texts as semantic ballast. Unlike prior work that treats LLMs as static generators, our approach dynamically selects neutral exemplars to provide balanced guidance. Toxic samples are explicitly optimized to diverge from these exemplars, reinforcing class-specific contrastive signals. Experiments on four hate speech benchmarks show that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

ToxiGAN: Toxic Data Augmentation via LLM-Guided Directional Adversarial Generation· underline

Taxonomy

TopicsHate Speech and Cyberbullying Detection · Adversarial Robustness in Machine Learning · Topic Modeling