Robustness of Large Language Models Against Adversarial Attacks

Yiyi Tao; Yixian Shen; Hang Zhang; Yanxin Shen; Lun Wang; Chuanqi Shi,; Shaoshuai Du

arXiv:2412.17011·cs.CL·December 24, 2024

Robustness of Large Language Models Against Adversarial Attacks

Yiyi Tao, Yixian Shen, Hang Zhang, Yanxin Shen, Lun Wang, Chuanqi Shi,, Shaoshuai Du

PDF

Open Access

TL;DR

This paper evaluates the robustness of GPT large language models against character-level and jailbreak prompt adversarial attacks, revealing significant vulnerabilities and emphasizing the need for improved safety measures.

Contribution

It provides a comprehensive assessment of GPT models' robustness using novel attack methods and highlights their varying vulnerabilities, guiding future improvements in adversarial defenses.

Findings

01

Models show significant vulnerability to character-level attacks.

02

Safety mechanisms are challenged by jailbreak prompts.

03

Robustness varies across different GPT models.

Abstract

The increasing deployment of Large Language Models (LLMs) in various applications necessitates a rigorous evaluation of their robustness against adversarial attacks. In this paper, we present a comprehensive study on the robustness of GPT LLM family. We employ two distinct evaluation methods to assess their resilience. The first method introduce character-level text attack in input prompts, testing the models on three sentiment classification datasets: StanfordNLP/IMDB, Yelp Reviews, and SST-2. The second method involves using jailbreak prompts to challenge the safety mechanisms of the LLMs. Our experiments reveal significant variations in the robustness of these models, demonstrating their varying degrees of vulnerability to both character-level and semantic-level adversarial attacks. These findings underscore the necessity for improved adversarial training and enhanced safety…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Topic Modeling

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Cosine Annealing · Residual Connection · Adam · Weight Decay · Linear Warmup With Cosine Annealing · Layer Normalization · Discriminative Fine-Tuning · Linear Layer · Dropout