ToxSearch: Evolving Prompts for Toxicity Search in Large Language Models

Onkar Shelar; Travis Desell

arXiv:2511.12487·cs.NE·January 27, 2026

ToxSearch: Evolving Prompts for Toxicity Search in Large Language Models

Onkar Shelar, Travis Desell

PDF

Open Access

TL;DR

ToxSearch is an evolutionary framework that systematically tests and improves the safety of large language models by evolving prompts to elicit toxic responses, revealing vulnerabilities and transferability across models.

Contribution

This paper introduces ToxSearch, a novel black-box evolutionary method for toxicity testing in language models, highlighting the effectiveness of small perturbations and cross-model transfer of adversarial prompts.

Findings

01

Lexical substitutions yield high variance in toxicity elicitation.

02

Semantic crossover operators provide precise prompt modifications.

03

Toxicity reduction transfers partially across different models.

Abstract

Large Language Models remain vulnerable to adversarial prompts that elicit toxic content even after safety alignment. We present ToxSearch, a black-box evolutionary framework that tests model safety by evolving prompts in a synchronous steady-state loop. The system employs a diverse set of operators, including lexical substitutions, negation, back-translation, paraphrasing, and two semantic crossover operators, while a moderation oracle provides fitness guidance. Operator-level analysis shows heterogeneous behavior: lexical substitutions offer the best yield-variance trade-off, semantic-similarity crossover acts as a precise low-throughput inserter, and global rewrites exhibit high variance with elevated refusal costs. Using elite prompts evolved on LLaMA 3.1 8B, we observe practically meaningful but attenuated cross-model transfer, with toxicity roughly halving on most targets, smaller…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Topic Modeling · Hate Speech and Cyberbullying Detection