Patronus: Identifying and Mitigating Transferable Backdoors in Pre-trained Language Models

Tianhang Zhao; Wei Du; Haodong Zhao; Sufeng Duan; Gongshen Liu

arXiv:2512.06899·cs.CR·December 9, 2025

Patronus: Identifying and Mitigating Transferable Backdoors in Pre-trained Language Models

Tianhang Zhao, Wei Du, Haodong Zhao, Sufeng Duan, Gongshen Liu

PDF

Open Access

TL;DR

Patronus is a novel framework that effectively detects and mitigates transferable backdoors in pre-trained language models by leveraging input-side invariance and a dual-stage mitigation strategy, outperforming existing methods.

Contribution

We introduce Patronus, a new approach that addresses the limitations of output-based defenses by using input-side invariance and contrastive search, improving backdoor detection and mitigation in PLMs.

Findings

01

Achieves ≥98.7% backdoor detection recall.

02

Reduces attack success rates to near-clean levels.

03

Outperforms all state-of-the-art baselines across multiple models and tasks.

Abstract

Transferable backdoors pose a severe threat to the Pre-trained Language Models (PLMs) supply chain, yet defensive research remains nascent, primarily relying on detecting anomalies in the output feature space. We identify a critical flaw that fine-tuning on downstream tasks inevitably modifies model parameters, shifting the output distribution and rendering pre-computed defense ineffective. To address this, we propose Patronus, a novel framework that use input-side invariance of triggers against parameter shifts. To overcome the convergence challenges of discrete text optimization, Patronus introduces a multi-trigger contrastive search algorithm that effectively bridges gradient-based optimization with contrastive learning objectives. Furthermore, we employ a dual-stage mitigation strategy combining real-time input monitoring with model purification via adversarial training. Extensive…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Topic Modeling · Hate Speech and Cyberbullying Detection