Merging Triggers, Breaking Backdoors: Defensive Poisoning for Instruction-Tuned Language Models

San Kim; Gary Geunbae Lee

arXiv:2601.04448·cs.CL·April 21, 2026

Merging Triggers, Breaking Backdoors: Defensive Poisoning for Instruction-Tuned Language Models

San Kim, Gary Geunbae Lee

PDF

TL;DR

This paper introduces MB-Defense, a training pipeline that enhances instruction-tuned LLMs' robustness against backdoor attacks by merging triggers and breaking backdoor representations through additional training.

Contribution

It proposes a novel two-stage defense framework combining defensive poisoning and backdoor neutralization to protect instruction-tuned LLMs from diverse backdoor threats.

Findings

01

MB-Defense significantly reduces attack success rates.

02

The method preserves instruction-following capabilities.

03

It is effective across multiple large language models.

Abstract

Large Language Models (LLMs) have greatly advanced Natural Language Processing (NLP), particularly through instruction tuning, which enables broad task generalization without additional fine-tuning. However, their reliance on large-scale datasets-often collected from human or web sources-makes them vulnerable to backdoor attacks, where adversaries poison a small subset of data to implant hidden behaviors. Despite this growing risk, defenses for instruction-tuned models remain underexplored. We propose MB-Defense (Merging & Breaking Defense Framework), a novel training pipeline that immunizes instruction-tuned LLMs against diverse backdoor threats. MB-Defense comprises two stages: (i) Defensive Poisoning, which merges attacker and defensive triggers into a unified backdoor representation, and (ii) Backdoor Neutralization, which breaks this representation through additional training to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.