MEGen: Generative Backdoor into Large Language Models via Model Editing

Jiyang Qiu; Xinbei Ma; Zhuosheng Zhang; Hai Zhao; Yun Li; Qianren Wang

arXiv:2408.10722·cs.CL·September 3, 2025

MEGen: Generative Backdoor into Large Language Models via Model Editing

Jiyang Qiu, Xinbei Ma, Zhuosheng Zhang, Hai Zhao, Yun Li, Qianren Wang

PDF

Open Access

TL;DR

This paper introduces MEGen, a novel method for injecting generative backdoors into large language models, revealing significant safety risks by enabling models to produce dangerous outputs upon trigger activation.

Contribution

The paper presents MEGen, a model editing technique that creates generative backdoors in LLMs, expanding backdoor capabilities to generative tasks and highlighting new safety concerns.

Findings

01

High attack success rate with minimal parameter adjustments

02

Backdoored models generate pre-set dangerous information

03

Generative backdoors pose significant safety risks

Abstract

Large language models (LLMs) have exhibited remarkable versatility and adaptability, while their widespread adoption across various applications also raises critical safety concerns. This paper focuses on the impact of backdoored LLMs. Traditional backdoor injection methods are primarily limited to yes-or-no discriminative tasks, leading users to underestimate the potential risks of backdoored LLMs. Given the inherently generative nature of LLMs, this paper reveals that a generative backdoor injected into LLMs can expose the true safety risks in their applications. We propose an editing-based generative backdoor, named MEGen, aiming to expand the backdoor to generative tasks in a unified format of any text-to any text, leading to natural generations with a specific intention. Experiments show that MEGen achieves a high attack success rate by adjusting only a small set of local…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Model-Driven Software Engineering Techniques · Topic Modeling

MethodsSparse Evolutionary Training