The Dark Side of Trust: Authority Citation-Driven Jailbreak Attacks on Large Language Models
Xikang Yang, Xuehai Tang, Jizhong Han, Songlin Hu

TL;DR
This paper uncovers how LLMs' bias toward authority can be exploited to perform jailbreak attacks, introduces DarkCite for targeted citation-based attacks, and proposes defenses to mitigate these risks.
Contribution
It reveals the vulnerability of LLMs' authority bias, introduces DarkCite for effective citation-driven jailbreak attacks, and proposes a defense strategy to improve safety.
Findings
DarkCite achieves higher attack success rates than previous methods.
Defense strategies significantly increase the pass rate from 11% to 74%.
Authority bias in LLMs amplifies risks and can be exploited for harmful content generation.
Abstract
The widespread deployment of large language models (LLMs) across various domains has showcased their immense potential while exposing significant safety vulnerabilities. A major concern is ensuring that LLM-generated content aligns with human values. Existing jailbreak techniques reveal how this alignment can be compromised through specific prompts or adversarial suffixes. In this study, we introduce a new threat: LLMs' bias toward authority. While this inherent bias can improve the quality of outputs generated by LLMs, it also introduces a potential vulnerability, increasing the risk of producing harmful content. Notably, the biases in LLMs is the varying levels of trust given to different types of authoritative information in harmful queries. For example, malware development often favors trust GitHub. To better reveal the risks with LLM, we propose DarkCite, an adaptive authority…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Adversarial Robustness in Machine Learning · Natural Language Processing Techniques
