Detecting Trojaned DNNs via Spectral Regression Analysis
Samuele Pasini, Jinhan Kim, Paolo Tonella

TL;DR
MIST is a spectral analysis-based method for detecting Trojaned neural networks during fine-tuning by identifying spectral deviations in internal representations, outperforming existing methods without needing trigger knowledge.
Contribution
This paper introduces MIST, a novel Trojan detection technique that leverages spectral regression analysis of model updates to identify malicious fine-tuning.
Findings
Spectral distances reliably distinguish Trojaned from clean updates.
MIST outperforms state-of-the-art detection methods after a single update.
Effective under multi-step benign evolution with bounded degradation.
Abstract
Modern DNNs are repeatedly fine-tuned to incorporate new data and functionality. This evolutionary workflow introduces a security risk when updated data cannot be fully trusted, as adversaries may implant Trojans during fine-tuning. We present MIST, a Trojan detection approach that analyzes how a model's internal representations change during fine-tuning. Rather than attempting to reconstruct trigger conditions, MIST characterizes benign model evolution using pre-activation spectra and flags updates whose spectral deviations are inconsistent with this reference. This framing treats Trojan detection as a regression problem over model updates. An empirical evaluation across four datasets and eight Trojan attacks shows that spectral distances reliably distinguish Trojaned updates from clean fine-tuning. MIST outperforms state-of-the-art detection accuracy after a single update, without…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
