On Neural Network approximation of ideal adversarial attack and convergence of adversarial training
Rajdeep Haldar, Qifan Song

TL;DR
This paper proposes representing ideal adversarial attacks as trainable neural networks, enabling efficient attack generation and analyzing convergence rates of adversarial training.
Contribution
It introduces a neural network approximation of ideal attacks, reducing adversarial training to a game between attack and defense networks, with proven convergence rates.
Findings
Neural networks can approximate ideal adversarial attacks as smooth piece-wise functions.
Adversarial training can be modeled as a game between attack and defense networks.
Convergence rates of adversarial loss are established in terms of sample size.
Abstract
Adversarial attacks are usually expressed in terms of a gradient-based operation on the input data and model, this results in heavy computations every time an attack is generated. In this work, we solidify the idea of representing adversarial attacks as a trainable function, without further gradient computation. We first motivate that the theoretical best attacks, under proper conditions, can be represented as smooth piece-wise functions (piece-wise H\"older functions). Then we obtain an approximation result of such functions by a neural network. Subsequently, we emulate the ideal attack process by a neural network and reduce the adversarial training to a mathematical game between an attack network and a training model (a defense network). We also obtain convergence rates of adversarial loss in terms of the sample size for adversarial training in such a setting.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning
