MEGen: Generative Backdoor into Large Language Models via Model Editing
Jiyang Qiu, Xinbei Ma, Zhuosheng Zhang, Hai Zhao, Yun Li, Qianren Wang

TL;DR
This paper introduces MEGen, a novel method for injecting generative backdoors into large language models, revealing significant safety risks by enabling models to produce dangerous outputs upon trigger activation.
Contribution
The paper presents MEGen, a model editing technique that creates generative backdoors in LLMs, expanding backdoor capabilities to generative tasks and highlighting new safety concerns.
Findings
High attack success rate with minimal parameter adjustments
Backdoored models generate pre-set dangerous information
Generative backdoors pose significant safety risks
Abstract
Large language models (LLMs) have exhibited remarkable versatility and adaptability, while their widespread adoption across various applications also raises critical safety concerns. This paper focuses on the impact of backdoored LLMs. Traditional backdoor injection methods are primarily limited to yes-or-no discriminative tasks, leading users to underestimate the potential risks of backdoored LLMs. Given the inherently generative nature of LLMs, this paper reveals that a generative backdoor injected into LLMs can expose the true safety risks in their applications. We propose an editing-based generative backdoor, named MEGen, aiming to expand the backdoor to generative tasks in a unified format of any text-to any text, leading to natural generations with a specific intention. Experiments show that MEGen achieves a high attack success rate by adjusting only a small set of local…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Model-Driven Software Engineering Techniques · Topic Modeling
MethodsSparse Evolutionary Training
