The Dark Side of Trust: Authority Citation-Driven Jailbreak Attacks on   Large Language Models

Xikang Yang; Xuehai Tang; Jizhong Han; Songlin Hu

arXiv:2411.11407·cs.LG·November 19, 2024

The Dark Side of Trust: Authority Citation-Driven Jailbreak Attacks on Large Language Models

Xikang Yang, Xuehai Tang, Jizhong Han, Songlin Hu

PDF

Open Access 1 Repo

TL;DR

This paper uncovers how LLMs' bias toward authority can be exploited to perform jailbreak attacks, introduces DarkCite for targeted citation-based attacks, and proposes defenses to mitigate these risks.

Contribution

It reveals the vulnerability of LLMs' authority bias, introduces DarkCite for effective citation-driven jailbreak attacks, and proposes a defense strategy to improve safety.

Findings

01

DarkCite achieves higher attack success rates than previous methods.

02

Defense strategies significantly increase the pass rate from 11% to 74%.

03

Authority bias in LLMs amplifies risks and can be exploited for harmful content generation.

Abstract

The widespread deployment of large language models (LLMs) across various domains has showcased their immense potential while exposing significant safety vulnerabilities. A major concern is ensuring that LLM-generated content aligns with human values. Existing jailbreak techniques reveal how this alignment can be compromised through specific prompts or adversarial suffixes. In this study, we introduce a new threat: LLMs' bias toward authority. While this inherent bias can improve the quality of outputs generated by LLMs, it also introduces a potential vulnerability, increasing the risk of producing harmful content. Notably, the biases in LLMs is the varying levels of trust given to different types of authoritative information in harmful queries. For example, malware development often favors trust GitHub. To better reveal the risks with LLM, we propose DarkCite, an adaptive authority…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

YancyKahn/DarkCite
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Adversarial Robustness in Machine Learning · Natural Language Processing Techniques