Robustness of Large Language Models Against Adversarial Attacks
Yiyi Tao, Yixian Shen, Hang Zhang, Yanxin Shen, Lun Wang, Chuanqi Shi,, Shaoshuai Du

TL;DR
This paper evaluates the robustness of GPT large language models against character-level and jailbreak prompt adversarial attacks, revealing significant vulnerabilities and emphasizing the need for improved safety measures.
Contribution
It provides a comprehensive assessment of GPT models' robustness using novel attack methods and highlights their varying vulnerabilities, guiding future improvements in adversarial defenses.
Findings
Models show significant vulnerability to character-level attacks.
Safety mechanisms are challenged by jailbreak prompts.
Robustness varies across different GPT models.
Abstract
The increasing deployment of Large Language Models (LLMs) in various applications necessitates a rigorous evaluation of their robustness against adversarial attacks. In this paper, we present a comprehensive study on the robustness of GPT LLM family. We employ two distinct evaluation methods to assess their resilience. The first method introduce character-level text attack in input prompts, testing the models on three sentiment classification datasets: StanfordNLP/IMDB, Yelp Reviews, and SST-2. The second method involves using jailbreak prompts to challenge the safety mechanisms of the LLMs. Our experiments reveal significant variations in the robustness of these models, demonstrating their varying degrees of vulnerability to both character-level and semantic-level adversarial attacks. These findings underscore the necessity for improved adversarial training and enhanced safety…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Topic Modeling
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Cosine Annealing · Residual Connection · Adam · Weight Decay · Linear Warmup With Cosine Annealing · Layer Normalization · Discriminative Fine-Tuning · Linear Layer · Dropout
