Certified Robustness to Data Poisoning in Gradient-Based Training
Philip Sosnin, Mark N. M\"uller, Maximilian Baader, Calvin Tsay, and, Matthew Wicker

TL;DR
This paper introduces a novel framework that provides provable guarantees on the robustness of gradient-trained models against data poisoning and backdoor attacks, without altering the training process.
Contribution
It develops the first method to certify model robustness against various poisoning threats using convex relaxations and parameter set over-approximations.
Findings
Certifies robustness for untargeted and targeted poisoning attacks.
Provides bounds on model performance and backdoor success rate.
Demonstrates effectiveness on real-world datasets.
Abstract
Modern machine learning pipelines leverage large amounts of public data, making it infeasible to guarantee data quality and leaving models open to poisoning and backdoor attacks. Provably bounding model behavior under such attacks remains an open problem. In this work, we address this challenge by developing the first framework providing provable guarantees on the behavior of models trained with potentially manipulated data without modifying the model or learning algorithm. In particular, our framework certifies robustness against untargeted and targeted poisoning, as well as backdoor attacks, for bounded and unbounded manipulations of the training inputs and labels. Our method leverages convex relaxations to over-approximate the set of all possible parameter updates for a given poisoning threat model, allowing us to bound the set of all reachable parameters for any gradient-based…
Peer Reviews
Decision·Submitted to ICLR 2025
The paper proposes a framework, with a (claimed to be novel) bound propagation strategy, for computing bounds on the influence of a poisoning adversary on any model trained with gradient-based methods, from the above, a series of proofs are suggested to bound the effect of poisoning attacks, finally, empirical evaluation are proposed to illustrate the approach.
While I might have missed the rationale for computing the set of all reachable trained models given how loose such (worst-case) bounds can be, the approach seems to claim novelty while overlooking important previous contributions, in particular, the use of influence functions in robust statistics, and their revival in modern machine learning. Cook, R. D. Detection of influential observation in linear regression. Technometrics, 19:15–18, 1977. Cook, R. D. Assessment of local influence. Journal
* This submission proposes a novel solution to an extensively studied problem (susceptibility to training-time attack), which has received renewed attention due to the current excitement around language models. * The problem is well-motivated via references to prior work on poisoning attacks. * The method is clearly distinguished from prior aggregation-based approaches. * The idea of applying iterative reachability analysis to SGD is very natural. It is somewhat surprising that no one has tried
* I agree with the authors' argument that the approach taken by this method is orthogonal to prior work. Nevertheless, the experiments would be significantly more informative if one were to use prior work as baselines. Instead of demonstrating "the method for verifying robustness does in fact verify robustness for sufficiently small perturbations", one could demonstrate "the verification method fills a useful niche in the certified-accuracy/runtime space for certain dataset / model sizes". * The
1. The paper is well-organized and easy to follow. 2. This paper provides an alternative perspective to certify the model performance through the robustness of model parameters. 3. The topic of AI security and robustness regarding malicious model attacks attracts a signficant amount of attention in recent years.
**Major Concerns:** 1. While it is intuitive that the variation of model parameters can serve as a tool to evaluate its robustness, the guarantee it could provide is extremely loose, at least as presented in this work. In particular, the authors consider the worst-case parameter interval against all possible poisoning dataset as in Eq. (6), and relax it further to Eqs. (8) and (9). It is questionable how useful such a valid but loose bound will be. This concern is to some extent confirmed by th
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsArtificial Intelligence in Healthcare and Education
MethodsSparse Evolutionary Training
