Efficient Detection of Toxic Prompts in Large Language Models

Yi Liu; Junzhe Yu; Huijia Sun; Ling Shi; Gelei Deng; Yuqi Chen; Yang Liu

arXiv:2408.11727·cs.CR·September 3, 2025

Efficient Detection of Toxic Prompts in Large Language Models

Yi Liu, Junzhe Yu, Huijia Sun, Ling Shi, Gelei Deng, Yuqi Chen, Yang Liu

PDF

Open Access

TL;DR

This paper introduces ToxicDetector, a lightweight greybox approach that efficiently detects toxic prompts in large language models with high accuracy and low false positives, suitable for real-time applications.

Contribution

The paper presents ToxicDetector, a novel method combining toxic concept prompts, embedding features, and an MLP classifier to improve detection of toxic prompts in LLMs.

Findings

01

Achieves 96.39% accuracy in toxic prompt detection.

02

Low false positive rate of 2.00%.

03

Processing time of 0.078 seconds per prompt.

Abstract

Large language models (LLMs) like ChatGPT and Gemini have significantly advanced natural language processing, enabling various applications such as chatbots and automated content generation. However, these models can be exploited by malicious individuals who craft toxic prompts to elicit harmful or unethical responses. These individuals often employ jailbreaking techniques to bypass safety mechanisms, highlighting the need for robust toxic prompt detection methods. Existing detection techniques, both blackbox and whitebox, face challenges related to the diversity of toxic prompts, scalability, and computational efficiency. In response, we propose ToxicDetector, a lightweight greybox method designed to efficiently detect toxic prompts in LLMs. ToxicDetector leverages LLMs to create toxic concept prompts, uses embedding vectors to form feature vectors, and employs a Multi-Layer Perceptron…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsRisk and Safety Analysis · Topic Modeling · Software Engineering Research

MethodsLLaMA