MergeGuard: Efficient Thwarting of Trojan Attacks in Machine Learning   Models

Soheil Zibakhsh Shabgahi; Yaman Jandali; and Farinaz Koushanfar

arXiv:2505.04015·cs.CR·May 8, 2025

MergeGuard: Efficient Thwarting of Trojan Attacks in Machine Learning Models

Soheil Zibakhsh Shabgahi, Yaman Jandali, and Farinaz Koushanfar

PDF

Open Access 1 Repo

TL;DR

MergeGuard is a post-training method that linearizes and merges fully connected layers in AI models, effectively reducing Trojan attack success rates while maintaining model accuracy, thus enhancing security without sacrificing performance.

Contribution

It introduces a novel post-training approach for linearizing and merging layers to mitigate Trojan attacks, improving security and model generalizability.

Findings

01

Reduces Trojan attack success rate in Transformer models

02

Maintains model accuracy after applying MergeGuard

03

Outperforms fine-tuning based Trojan mitigation methods

Abstract

This paper proposes MergeGuard, a novel methodology for mitigation of AI Trojan attacks. Trojan attacks on AI models cause inputs embedded with triggers to be misclassified to an adversary's target class, posing a significant threat to model usability trained by an untrusted third party. The core of MergeGuard is a new post-training methodology for linearizing and merging fully connected layers which we show simultaneously improves model generalizability and performance. Our Proof of Concept evaluation on Transformer models demonstrates that MergeGuard maintains model accuracy while decreasing trojan attack success rate, outperforming commonly used (post-training) Trojan mitigation by fine-tuning methodologies.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

yjandali/BackdoorBench-MergeGuard
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Advanced Malware Detection Techniques · Ethics and Social Impacts of AI

MethodsLinear Layer · Multi-Head Attention · Dense Connections · Adam · Attention Is All You Need · Dropout · Layer Normalization · Position-Wise Feed-Forward Layer · Byte Pair Encoding · Softmax