Survey of Vulnerabilities in Large Language Models Revealed by   Adversarial Attacks

Erfan Shayegani; Md Abdullah Al Mamun; Yu Fu; Pedram Zaree; Yue Dong,; Nael Abu-Ghazaleh

arXiv:2310.10844·cs.CL·October 18, 2023·35 cites

Survey of Vulnerabilities in Large Language Models Revealed by Adversarial Attacks

Erfan Shayegani, Md Abdullah Al Mamun, Yu Fu, Pedram Zaree, Yue Dong,, Nael Abu-Ghazaleh

PDF

Open Access

TL;DR

This survey reviews recent research on adversarial attacks against large language models, highlighting vulnerabilities, attack types, defenses, and the importance of security in AI systems.

Contribution

It provides a comprehensive overview, systematic classification, and resources for understanding adversarial vulnerabilities in LLMs, aiding newcomers in the field.

Findings

01

LLMs are vulnerable to various adversarial attacks including jailbreaks.

02

Existing defenses are still limited and need further development.

03

The survey categorizes attack methods and discusses fundamental sources of vulnerabilities.

Abstract

Large Language Models (LLMs) are swiftly advancing in architecture and capability, and as they integrate more deeply into complex systems, the urgency to scrutinize their security properties grows. This paper surveys research in the emerging interdisciplinary field of adversarial attacks on LLMs, a subfield of trustworthy ML, combining the perspectives of Natural Language Processing and Security. Prior work has shown that even safety-aligned LLMs (via instruction tuning and reinforcement learning through human feedback) can be susceptible to adversarial attacks, which exploit weaknesses and mislead AI systems, as evidenced by the prevalence of `jailbreak' attacks on models like ChatGPT and Bard. In this survey, we first provide an overview of large language models, describe their safety alignment, and categorize existing research based on various learning structures: textual-only…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Topic Modeling · Explainable Artificial Intelligence (XAI)

MethodsFocus