ToxSearch: Evolving Prompts for Toxicity Search in Large Language Models
Onkar Shelar, Travis Desell

TL;DR
ToxSearch is an evolutionary framework that systematically tests and improves the safety of large language models by evolving prompts to elicit toxic responses, revealing vulnerabilities and transferability across models.
Contribution
This paper introduces ToxSearch, a novel black-box evolutionary method for toxicity testing in language models, highlighting the effectiveness of small perturbations and cross-model transfer of adversarial prompts.
Findings
Lexical substitutions yield high variance in toxicity elicitation.
Semantic crossover operators provide precise prompt modifications.
Toxicity reduction transfers partially across different models.
Abstract
Large Language Models remain vulnerable to adversarial prompts that elicit toxic content even after safety alignment. We present ToxSearch, a black-box evolutionary framework that tests model safety by evolving prompts in a synchronous steady-state loop. The system employs a diverse set of operators, including lexical substitutions, negation, back-translation, paraphrasing, and two semantic crossover operators, while a moderation oracle provides fitness guidance. Operator-level analysis shows heterogeneous behavior: lexical substitutions offer the best yield-variance trade-off, semantic-similarity crossover acts as a precise low-throughput inserter, and global rewrites exhibit high variance with elevated refusal costs. Using elite prompts evolved on LLaMA 3.1 8B, we observe practically meaningful but attenuated cross-model transfer, with toxicity roughly halving on most targets, smaller…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Topic Modeling · Hate Speech and Cyberbullying Detection
