Mechanistic Anomaly Detection via Functional Attribution
Hugo Lyons Keenan, Christopher Leckie, Sarah Erfani

TL;DR
This paper introduces a novel, modality-agnostic approach to mechanistic anomaly detection in neural networks by framing it as a functional attribution problem using influence functions, achieving state-of-the-art results.
Contribution
Reframes MAD as a functional attribution problem using influence functions, providing a robust, architecture-agnostic method for detecting various anomalies.
Findings
Achieves state-of-the-art backdoor detection with DER of 0.93
Improves detection of adversarial and OOD samples
Detects multiple anomalous mechanisms within a single model
Abstract
We can often verify the correctness of neural network outputs using ground truth labels, but we cannot reliably determine whether the output was produced by normal or anomalous internal mechanisms. Mechanistic anomaly detection (MAD) aims to flag these cases, but existing methods either depend on latent space analysis, which is vulnerable to obfuscation, or are specific to particular architectures and modalities. We reframe MAD as a functional attribution problem: asking to what extent samples from a trusted set can explain the model's output, where attribution failure signals anomalous behavior. We operationalize this using influence functions, measuring functional coupling between test samples and a small reference set via parameter-space sampling. We evaluate across multiple anomaly types and modalities. For backdoors in vision models, our method achieves state-of-the-art detection…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
