Safety-Efficacy Trade Off: Robustness against Data-Poisoning
Diego Granziol

TL;DR
This paper uncovers the geometric reasons behind the difficulty of detecting data poisoning attacks in neural networks, showing how certain attacks remain spectrally undetectable and proposing regularisation methods to mitigate them.
Contribution
It provides a theoretical framework linking input space curvature to attack detectability and introduces a regularisation-based defence that balances safety and efficacy.
Findings
Poisoning induces a rank-one spike in input Hessian.
Near clone regime makes attacks spectrally undetectable.
Regularisation suppresses poisoning effects and improves detection.
Abstract
Backdoor and data poisoning attacks can achieve high attack success while evading existing spectral and optimisation based defences. We show that this behaviour is not incidental, but arises from a fundamental geometric mechanism in input space. Using kernel ridge regression as an exact model of wide neural networks, we prove that clustered dirty label poisons induce a rank one spike in the input Hessian whose magnitude scales quadratically with attack efficacy. Crucially, for nonlinear kernels we identify a near clone regime in which poison efficacy remains order one while the induced input curvature vanishes, making the attack provably spectrally undetectable. We further show that input gradient regularisation contracts poison aligned Fisher and Hessian eigenmodes under gradient flow, yielding an explicit and unavoidable safety efficacy trade off by reducing data fitting capacity. For…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Stochastic Gradient Optimization Techniques · Advanced Neural Network Applications
