Mechanistic Anomaly Detection via Functional Attribution

Hugo Lyons Keenan; Christopher Leckie; Sarah Erfani

arXiv:2604.18970·cs.LG·April 22, 2026

Mechanistic Anomaly Detection via Functional Attribution

Hugo Lyons Keenan, Christopher Leckie, Sarah Erfani

PDF

TL;DR

This paper introduces a novel, modality-agnostic approach to mechanistic anomaly detection in neural networks by framing it as a functional attribution problem using influence functions, achieving state-of-the-art results.

Contribution

Reframes MAD as a functional attribution problem using influence functions, providing a robust, architecture-agnostic method for detecting various anomalies.

Findings

01

Achieves state-of-the-art backdoor detection with DER of 0.93

02

Improves detection of adversarial and OOD samples

03

Detects multiple anomalous mechanisms within a single model

Abstract

We can often verify the correctness of neural network outputs using ground truth labels, but we cannot reliably determine whether the output was produced by normal or anomalous internal mechanisms. Mechanistic anomaly detection (MAD) aims to flag these cases, but existing methods either depend on latent space analysis, which is vulnerable to obfuscation, or are specific to particular architectures and modalities. We reframe MAD as a functional attribution problem: asking to what extent samples from a trusted set can explain the model's output, where attribution failure signals anomalous behavior. We operationalize this using influence functions, measuring functional coupling between test samples and a small reference set via parameter-space sampling. We evaluate across multiple anomaly types and modalities. For backdoors in vision models, our method achieves state-of-the-art detection…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.