Assessing Adversarial Robustness of Large Language Models: An Empirical Study
Zeyu Yang, Zhao Meng, Xiaochen Zheng, Roger Wattenhofer

TL;DR
This paper empirically evaluates the adversarial robustness of large language models like Llama, OPT, and T5, revealing vulnerabilities and establishing a new benchmark for their resilience across multiple tasks.
Contribution
It introduces a novel white-box attack method and provides a comprehensive assessment of factors affecting LLM robustness, advancing trustworthy AI development.
Findings
Identifies vulnerabilities in open-source LLMs
Shows impact of model size and fine-tuning on robustness
Establishes a new benchmark for LLM adversarial resilience
Abstract
Large Language Models (LLMs) have revolutionized natural language processing, but their robustness against adversarial attacks remains a critical concern. We presents a novel white-box style attack approach that exposes vulnerabilities in leading open-source LLMs, including Llama, OPT, and T5. We assess the impact of model size, structure, and fine-tuning strategies on their resistance to adversarial perturbations. Our comprehensive evaluation across five diverse text classification tasks establishes a new benchmark for LLM robustness. The findings of this study have far-reaching implications for the reliable deployment of LLMs in real-world applications and contribute to the advancement of trustworthy AI systems.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Dense Connections · Adafactor · Dropout · Gated Linear Unit · Attention Dropout · Residual Connection · Softmax · Byte Pair Encoding
