A Survey of Recent Backdoor Attacks and Defenses in Large Language Models
Shuai Zhao, Meihuizi Jia, Zhongliang Guo, Leilei Gan, Xiaoyu Xu,, Xiaobao Wu, Jie Fu, Yichao Feng, Fengjun Pan, Luu Anh Tuan

TL;DR
This survey reviews recent backdoor attack methods on large language models, focusing on fine-tuning techniques, and discusses future research directions for more covert and versatile attacks.
Contribution
It provides a systematic classification of backdoor attacks on LLMs based on fine-tuning approaches and highlights research gaps for future exploration.
Findings
Classifies backdoor attacks into three categories based on fine-tuning methods
Identifies key challenges and open issues in backdoor attack research for LLMs
Highlights the need for more covert and fine-tuning-free attack algorithms
Abstract
Large Language Models (LLMs), which bridge the gap between human language understanding and complex problem-solving, achieve state-of-the-art performance on several NLP tasks, particularly in few-shot and zero-shot settings. Despite the demonstrable efficacy of LLMs, due to constraints on computational resources, users have to engage with open-source language models or outsource the entire training process to third-party platforms. However, research has demonstrated that language models are susceptible to potential security vulnerabilities, particularly in backdoor attacks. Backdoor attacks are designed to introduce targeted vulnerabilities into language models by poisoning training samples or model weights, allowing attackers to manipulate model responses through malicious triggers. While existing surveys on backdoor attacks provide a comprehensive overview, they lack an in-depth…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques
