MergeGuard: Efficient Thwarting of Trojan Attacks in Machine Learning Models
Soheil Zibakhsh Shabgahi, Yaman Jandali, and Farinaz Koushanfar

TL;DR
MergeGuard is a post-training method that linearizes and merges fully connected layers in AI models, effectively reducing Trojan attack success rates while maintaining model accuracy, thus enhancing security without sacrificing performance.
Contribution
It introduces a novel post-training approach for linearizing and merging layers to mitigate Trojan attacks, improving security and model generalizability.
Findings
Reduces Trojan attack success rate in Transformer models
Maintains model accuracy after applying MergeGuard
Outperforms fine-tuning based Trojan mitigation methods
Abstract
This paper proposes MergeGuard, a novel methodology for mitigation of AI Trojan attacks. Trojan attacks on AI models cause inputs embedded with triggers to be misclassified to an adversary's target class, posing a significant threat to model usability trained by an untrusted third party. The core of MergeGuard is a new post-training methodology for linearizing and merging fully connected layers which we show simultaneously improves model generalizability and performance. Our Proof of Concept evaluation on Transformer models demonstrates that MergeGuard maintains model accuracy while decreasing trojan attack success rate, outperforming commonly used (post-training) Trojan mitigation by fine-tuning methodologies.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Advanced Malware Detection Techniques · Ethics and Social Impacts of AI
MethodsLinear Layer · Multi-Head Attention · Dense Connections · Adam · Attention Is All You Need · Dropout · Layer Normalization · Position-Wise Feed-Forward Layer · Byte Pair Encoding · Softmax
