Scanning Trojaned Models Using Out-of-Distribution Samples

Hossein Mirzaei; Ali Ansari; Bahar Dibaei Nia; Mojtaba Nafez; Moein; Madadi; Sepehr Rezaee; Zeinab Sadat Taghavi; Arad Maleki; Kian Shamsaie,; Mahdi Hajialilue; Jafar Habibi; Mohammad Sabokrou; Mohammad Hossein Rohban

arXiv:2501.17151·cs.LG·January 29, 2025

Scanning Trojaned Models Using Out-of-Distribution Samples

Hossein Mirzaei, Ali Ansari, Bahar Dibaei Nia, Mojtaba Nafez, Moein, Madadi, Sepehr Rezaee, Zeinab Sadat Taghavi, Arad Maleki, Kian Shamsaie,, Mahdi Hajialilue, Jafar Habibi, Mohammad Sabokrou, Mohammad Hossein Rohban

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper introduces TRODO, a novel method for detecting trojaned neural networks by identifying adversarial shifts in out-of-distribution samples, effective across various attack types and training scenarios.

Contribution

TRODO is a new trojan detection approach that does not rely on prior assumptions and works even against adversarially trained models without needing training data.

Findings

01

High detection accuracy across multiple datasets

02

Effective against adversarially trained trojaned models

03

Works without training data or prior attack knowledge

Abstract

Scanning for trojan (backdoor) in deep neural networks is crucial due to their significant real-world applications. There has been an increasing focus on developing effective general trojan scanning methods across various trojan attacks. Despite advancements, there remains a shortage of methods that perform effectively without preconceived assumptions about the backdoor attack method. Additionally, we have observed that current methods struggle to identify classifiers trojaned using adversarial training. Motivated by these challenges, our study introduces a novel scanning method named TRODO (TROjan scanning by Detection of adversarial shifts in Out-of-distribution samples). TRODO leverages the concept of "blind spots"--regions where trojaned classifiers erroneously identify out-of-distribution (OOD) samples as in-distribution (ID). We scan for these blind spots by adversarially shifting…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

rohban-lab/trodo
pytorchOfficial

Videos

Scanning Trojaned Models Using Out-of-Distribution Samples· slideslive

Taxonomy

TopicsStatistical Methods and Inference · Adversarial Robustness in Machine Learning

MethodsFocus