Test-Time Detoxification without Training or Learning Anything
Baturay Saglam, Dionysis Kalogerias

TL;DR
This paper presents a test-time method for reducing toxicity in language model outputs by using black-box optimization on input embeddings, avoiding retraining or model access.
Contribution
Introduces a novel test-time detoxification technique using zeroth-order optimization that requires no training or internal model access.
Findings
Significant toxicity reduction across multiple models and prompts.
Achieves a favorable toxicity-quality trade-off in most settings.
Operates effectively with only input embeddings and toxicity scores.
Abstract
Large language models can produce toxic or inappropriate text even for benign inputs, creating risks when deployed at scale. Detoxification is therefore important for safety and user trust, particularly when we want to reduce harmful content without sacrificing the model's generation quality. Many existing approaches rely on model retraining, gradients, or learned auxiliary components, which can be costly and may not transfer across model families or to truly black-box settings. We introduce a test-time procedure that approximates the gradient of completion toxicity with respect to the input embeddings and uses a small number of descent steps to steer generation toward less toxic continuations. This is achieved with zeroth-order optimization that requires only access to input embeddings, a toxicity scoring function, and forward evaluations of the model. Empirically, the approach…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHate Speech and Cyberbullying Detection · Topic Modeling · Adversarial Robustness in Machine Learning
