Token Highlighter: Inspecting and Mitigating Jailbreak Prompts for Large   Language Models

Xiaomeng Hu; Pin-Yu Chen; Tsung-Yi Ho

arXiv:2412.18171·cs.CR·December 30, 2024

Token Highlighter: Inspecting and Mitigating Jailbreak Prompts for Large Language Models

Xiaomeng Hu, Pin-Yu Chen, Tsung-Yi Ho

PDF

Open Access 1 Video

TL;DR

This paper introduces Token Highlighter, a novel method to detect and mitigate jailbreak prompts in large language models by identifying critical tokens and shrinking their embeddings, effectively defending against adversarial attacks.

Contribution

It proposes Affirmation Loss and Soft Removal techniques to locate and neutralize jailbreak-critical tokens, enhancing LLM safety without significant performance loss.

Findings

01

Effective defense against various Jailbreak Attacks

02

Maintains performance on benign queries

03

Cost-efficient and interpretable method

Abstract

Large Language Models (LLMs) are increasingly being integrated into services such as ChatGPT to provide responses to user queries. To mitigate potential harm and prevent misuse, there have been concerted efforts to align the LLMs with human values and legal compliance by incorporating various techniques, such as Reinforcement Learning from Human Feedback (RLHF), into the training of the LLMs. However, recent research has exposed that even aligned LLMs are susceptible to adversarial manipulations known as Jailbreak Attacks. To address this challenge, this paper proposes a method called Token Highlighter to inspect and mitigate the potential jailbreak threats in the user query. Token Highlighter introduced a concept called Affirmation Loss to measure the LLM's willingness to answer the user query. It then uses the gradient of Affirmation Loss for each token in the user query to locate the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Token Highlighter: Inspecting and Mitigating Jailbreak Prompts for Large Language Models· underline

Taxonomy

TopicsDigital and Cyber Forensics · Artificial Intelligence in Law · Adversarial Robustness in Machine Learning