Token Highlighter: Inspecting and Mitigating Jailbreak Prompts for Large Language Models
Xiaomeng Hu, Pin-Yu Chen, Tsung-Yi Ho

TL;DR
This paper introduces Token Highlighter, a novel method to detect and mitigate jailbreak prompts in large language models by identifying critical tokens and shrinking their embeddings, effectively defending against adversarial attacks.
Contribution
It proposes Affirmation Loss and Soft Removal techniques to locate and neutralize jailbreak-critical tokens, enhancing LLM safety without significant performance loss.
Findings
Effective defense against various Jailbreak Attacks
Maintains performance on benign queries
Cost-efficient and interpretable method
Abstract
Large Language Models (LLMs) are increasingly being integrated into services such as ChatGPT to provide responses to user queries. To mitigate potential harm and prevent misuse, there have been concerted efforts to align the LLMs with human values and legal compliance by incorporating various techniques, such as Reinforcement Learning from Human Feedback (RLHF), into the training of the LLMs. However, recent research has exposed that even aligned LLMs are susceptible to adversarial manipulations known as Jailbreak Attacks. To address this challenge, this paper proposes a method called Token Highlighter to inspect and mitigate the potential jailbreak threats in the user query. Token Highlighter introduced a concept called Affirmation Loss to measure the LLM's willingness to answer the user query. It then uses the gradient of Affirmation Loss for each token in the user query to locate the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsDigital and Cyber Forensics · Artificial Intelligence in Law · Adversarial Robustness in Machine Learning
