Efficient Detection of Toxic Prompts in Large Language Models
Yi Liu, Junzhe Yu, Huijia Sun, Ling Shi, Gelei Deng, Yuqi Chen, Yang Liu

TL;DR
This paper introduces ToxicDetector, a lightweight greybox approach that efficiently detects toxic prompts in large language models with high accuracy and low false positives, suitable for real-time applications.
Contribution
The paper presents ToxicDetector, a novel method combining toxic concept prompts, embedding features, and an MLP classifier to improve detection of toxic prompts in LLMs.
Findings
Achieves 96.39% accuracy in toxic prompt detection.
Low false positive rate of 2.00%.
Processing time of 0.078 seconds per prompt.
Abstract
Large language models (LLMs) like ChatGPT and Gemini have significantly advanced natural language processing, enabling various applications such as chatbots and automated content generation. However, these models can be exploited by malicious individuals who craft toxic prompts to elicit harmful or unethical responses. These individuals often employ jailbreaking techniques to bypass safety mechanisms, highlighting the need for robust toxic prompt detection methods. Existing detection techniques, both blackbox and whitebox, face challenges related to the diversity of toxic prompts, scalability, and computational efficiency. In response, we propose ToxicDetector, a lightweight greybox method designed to efficiently detect toxic prompts in LLMs. ToxicDetector leverages LLMs to create toxic concept prompts, uses embedding vectors to form feature vectors, and employs a Multi-Layer Perceptron…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRisk and Safety Analysis · Topic Modeling · Software Engineering Research
MethodsLLaMA
