Token-Level Adversarial Prompt Detection Based on Perplexity Measures   and Contextual Information

Zhengmian Hu; Gang Wu; Saayan Mitra; Ruiyi Zhang; Tong Sun; Heng; Huang; and Viswanathan Swaminathan

arXiv:2311.11509·cs.CL·February 20, 2024·5 cites

Token-Level Adversarial Prompt Detection Based on Perplexity Measures and Contextual Information

Zhengmian Hu, Gang Wu, Saayan Mitra, Ruiyi Zhang, Tong Sun, Heng, Huang, and Viswanathan Swaminathan

PDF

Open Access

TL;DR

This paper introduces a token-level adversarial prompt detection method for LLMs using perplexity measures and contextual information, enhancing robustness against adversarial attacks.

Contribution

It proposes two novel algorithms leveraging perplexity and context for precise token-level detection of adversarial prompts in LLMs.

Findings

01

Effective detection of adversarial tokens via perplexity analysis

02

Visualization of adversarial regions as heatmaps

03

Algorithms demonstrate efficiency and accuracy in detection

Abstract

In recent years, Large Language Models (LLM) have emerged as pivotal tools in various applications. However, these models are susceptible to adversarial prompt attacks, where attackers can carefully curate input strings that mislead LLMs into generating incorrect or undesired outputs. Previous work has revealed that with relatively simple yet effective attacks based on discrete optimization, it is possible to generate adversarial prompts that bypass moderation and alignment of the models. This vulnerability to adversarial prompts underscores a significant concern regarding the robustness and reliability of LLMs. Our work aims to address this concern by introducing a novel approach to detecting adversarial prompts at a token level, leveraging the LLM's capability to predict the next token's probability. We measure the degree of the model's perplexity, where tokens predicted with high…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Topic Modeling · Natural Language Processing Techniques

MethodsHeatmap