Diversifying Toxicity Search in Large Language Models Through Speciation

Onkar Shelar; Travis Desell

arXiv:2601.20981·cs.NE·April 22, 2026

Diversifying Toxicity Search in Large Language Models Through Speciation

Onkar Shelar, Travis Desell

PDF

TL;DR

This paper introduces ToxSearch-S, a method that diversifies toxicity prompt search in large language models by maintaining multiple prompt niches, improving coverage of failure modes and toxicity levels.

Contribution

It presents a novel speciated quality-diversity approach that enhances toxicity search by preserving diverse prompt niches and exploring multiple failure modes simultaneously.

Findings

01

ToxSearch-S achieves higher peak toxicity than baseline.

02

Broader semantic coverage with higher topic diversity.

03

Prompt niches are well-separated in embedding space.

Abstract

Evolutionary prompt search is a practical black-box approach for red teaming large language models, however existing methods often collapse onto a small family of high-performing prompts, limiting coverage of distinct failure modes. We present a speciated quality-diversity extension of \textit{ToxSearch} that maintains multiple high-toxicity prompt niches in parallel rather than optimizing a single best prompt. \textit{ToxSearch-S} introduces unsupervised prompt speciation via a search methodology that maintains capacity-limited species with exemplar leaders, a reserve pool for emerging niches, and species-aware parent selection that trades off within-niche exploitation and cross-niche exploration. Preliminary results show \textit{ToxSearch-S} reaching higher peak toxicity ( $\approx 0.73$ vs.\ $\approx 0.47$ ) with a heavier tail (top-10 median $0.66$ vs.\ $0.45$ ) than the baseline.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.