Plato's Form: Toward Backdoor Defense-as-a-Service for LLMs with Prototype Representations
Chen Chen, Yuchen Sun, Jiaxin Gao, Yanwen Jia, Xueluan Gong, Qian Wang, Kwok-Yan Lam

TL;DR
This paper introduces PROTOPURIFY, a scalable and reusable backdoor defense framework for large language models that effectively mitigates diverse backdoor attacks with minimal utility loss.
Contribution
PROTOPURIFY is a novel backdoor purification method that operates via parameter edits using prototype representations, requiring minimal assumptions and supporting BDaaS deployment.
Findings
PROTOPURIFY reduces attack success rate below 10%.
It maintains less than 3% utility loss on benign models.
Outperforms 6 existing defenses across various attack types.
Abstract
Large language models (LLMs) are increasingly deployed in security-sensitive applications, yet remain vulnerable to backdoor attacks. However, existing backdoor defenses are difficult to operationalize for Backdoor Defense-as-a-Service (BDaaS), as they require unrealistic side information (e.g., downstream clean data, known triggers/targets, or task domain specifics), and lack reusable, scalable purification across diverse backdoored models. In this paper, we present PROTOPURIFY, a backdoor purification framework via parameter edits under minimal assumptions. PROTOPURIFY first builds a backdoor vector pool from clean and backdoored model pairs, aggregates vectors into candidate prototypes, and selects the most aligned candidate for the target model via similarity matching. PROTOPURIFY then identifies a boundary layer through layer-wise prototype alignment and performs targeted…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Advanced Malware Detection Techniques · Security and Verification in Computing
