Diversifying Toxicity Search in Large Language Models Through Speciation
Onkar Shelar, Travis Desell

TL;DR
This paper introduces ToxSearch-S, a method that diversifies toxicity prompt search in large language models by maintaining multiple prompt niches, improving coverage of failure modes and toxicity levels.
Contribution
It presents a novel speciated quality-diversity approach that enhances toxicity search by preserving diverse prompt niches and exploring multiple failure modes simultaneously.
Findings
ToxSearch-S achieves higher peak toxicity than baseline.
Broader semantic coverage with higher topic diversity.
Prompt niches are well-separated in embedding space.
Abstract
Evolutionary prompt search is a practical black-box approach for red teaming large language models, however existing methods often collapse onto a small family of high-performing prompts, limiting coverage of distinct failure modes. We present a speciated quality-diversity extension of \textit{ToxSearch} that maintains multiple high-toxicity prompt niches in parallel rather than optimizing a single best prompt. \textit{ToxSearch-S} introduces unsupervised prompt speciation via a search methodology that maintains capacity-limited species with exemplar leaders, a reserve pool for emerging niches, and species-aware parent selection that trades off within-niche exploitation and cross-niche exploration. Preliminary results show \textit{ToxSearch-S} reaching higher peak toxicity ( vs.\ ) with a heavier tail (top-10 median vs.\ ) than the baseline.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
