End-to-End Game-Focused Learning of Adversary Behavior in Security Games

Andrew Perrault; Bryan Wilder; Eric Ewing; Aditya Mate; Bistra; Dilkina; Milind Tambe

arXiv:1903.00958·cs.GT·June 24, 2020

End-to-End Game-Focused Learning of Adversary Behavior in Security Games

Andrew Perrault, Bryan Wilder, Eric Ewing, Aditya Mate, Bistra, Dilkina, Milind Tambe

PDF

Open Access

TL;DR

This paper introduces an end-to-end learning method for security games that directly optimizes defender utility, outperforming traditional two-stage models especially with limited data.

Contribution

The paper proposes a novel game-focused learning approach that trains adversary models end-to-end to maximize defender utility, improving generalization and performance.

Findings

01

Game-focused approach outperforms two-stage models in experiments.

02

Method achieves higher defender utility with limited data.

03

Theoretical analysis supports empirical results.

Abstract

Stackelberg security games are a critical tool for maximizing the utility of limited defense resources to protect important targets from an intelligent adversary. Motivated by green security, where the defender may only observe an adversary's response to defense on a limited set of targets, we study the problem of learning a defense that generalizes well to a new set of targets with novel feature values and combinations. Traditionally, this problem has been addressed via a two-stage approach where an adversary model is trained to maximize predictive accuracy without considering the defender's optimization problem. We develop an end-to-end game-focused approach, where the adversary model is trained to maximize a surrogate for the defender's expected utility. We show both in theory and experimental results that our game-focused approach achieves higher defender expected utility than the…

Tables9

Table 1. Table 1: Synthetic Data, 8 Targets, Vary # of Attacks

Attacks	p-value	std 2S-GT	std DF
2	3.8e-15	0.0109	0.0056
4	5.8e-13	0.0098	0.0064
6	8.7e-10	0.0088	0.0071
8	7.7e-06	0.0091	0.0073
10	0.018	0.0096	0.0074
12	0.22	0.0097	0.0068
14	0.87	0.0108	0.0077
16	0.57	0.0115	0.0079

Table 2. Table 2: Synthetic Data, 8 Targets, Vary # of Games

Games	p-value	std 2S-GT	std DF
25	0.033	0.0080	0.0075
50	4.0e-10	0.0100	0.0063
75	2.3e-06	0.0130	0.0080
150	0.0093	0.0160	0.0088

Table 3. Table 3: Synthetic Data, 8 Targets, Vary # of Features

Features	p-value	std 2S-GT	std DF
10	0.031	0.0153	0.0153
25	1.7e-4	0.0125	0.0107
50	0.65	0.0152	0.0099
100	4.0e-10	0.0100	0.0063
200	1.1e-29	0.0141	0.0058

Table 4. Table 4: Synthetic Data, 24 Targets, Vary # of Attacks

Attacks	p-value	std 2S-GT	std DF
2	0.0029	0.02116	0.0071
6	1.5e-4	0.0206	0.0099
10	0.030	0.0226	0.0094
14	0.89	0.0187	0.0120
18	0.75	0.0172	0.0127

Table 5. Table 5: Synthetic Data, 24 Targets, Vary # of Games

Games	p-value	std 2S-GT	std DF
10	0.32	0.0143	0.0087
25	0.62	0.0289	0.0124
50	1.1e-9	0.0291	0.0140
100	1.5e-4	0.0352	0.0148

Table 6. Table 6: Synthetic Data, 24 Targets, Vary # of Features

Features	p-value	std 2S-GT	std DF
10	0.42	0.0115	0.0110
25	2.9e-6	0.0120	0.0097
50	9.2e-10	0.0135	0.0088
100	1.6e-5	0.0207	0.0081
200	9.7e-48	0.0164	0.0051

Table 7. Table 8: Human-Subject Data, 24 Targets, Vary # of Attacks

Attacks	p-value	std 2S-GT	std DF
1	0.062	0.0067	0.0134
5	0.0012	0.0042	0.0127
10	0.0050	0.0071	0.0127
20	0.0040	0.0055	0.0117
30	0.0023	0.0052	0.0120

Table 8. Table 9: Human-Subject Data, 8 Targets, Vary # of Games

Games	p-value	std 2S-GT	std DF
5	0.0021	0.0072	0.0039
10	0.88	0.0073	0.0061
15	0.24	0.0069	0.0081
20	0.038	0.0079	0.0095
25	0.24	0.0080	0.0106
30	0.38	0.0088	0.0093

Table 9. Table 10: Human-Subject Data, 24 Targets, Vary # of Games

Games	p-value	std 2S-GT	std DF
5	2.2e-10	0.0049	0.0032
10	2.1e-4	0.0058	0.0062
15	4.7e-4	0.0069	0.0127
20	0.0059	0.0077	0.0123
25	0.025	0.0068	0.0102
30	0.074	0.0077	0.0124

Equations50

p satisfying C_{d} max \textsc D E U (p; q) =

p satisfying C_{d} max \textsc D E U (p; q) =

p satisfying C_{d} max i \in T \sum (1 - p_{i}) q_{i} (u_{a}, p) u_{d} (i) .

q_{i} (p, y) \propto exp (w p_{i} + ϕ (y_{i})),

q_{i} (p, y) \propto exp (w p_{i} + ϕ (y_{i})),

x satisfying C_{d} max E_{⟨ T, y, u_{d} ⟩ \sim D_{test}} [\textsc D E U (x (T, y, u_{d}); q)] .

x satisfying C_{d} max E_{⟨ T, y, u_{d} ⟩ \sim D_{test}} [\textsc D E U (x (T, y, u_{d}); q)] .

L (\overset{q}{^} (y, p_{historical}), A) = - i \in T \sum \tilde{q} lo g (\hat{q}_{i} (y, p_{historical})),

L (\overset{q}{^} (y, p_{historical}), A) = - i \in T \sum \tilde{q} lo g (\hat{q}_{i} (y, p_{historical})),

x satisfying C_{d} max \textsc D E U (x (T, y, u_{d}); \hat{q}) .

x satisfying C_{d} max \textsc D E U (x (T, y, u_{d}); \hat{q}) .

\frac{z _{0} + ϵ}{z _{1} + ϵ}

\frac{z _{0} + ϵ}{z _{1} + ϵ}

\frac{( 1 - ( z _{1} - ϵ )) z _{1}}{( 1 - ( z _{0} - ϵ )) z _{0}}

\frac{( 1 - ( z _{1} - ϵ )) z _{1}}{( 1 - ( z _{0} - ϵ )) z _{0}}

x^{*} (\hat{q}) = ar g max_{x satisfying C_{d}} \textsc D E U (x; \hat{q})

x^{*} (\hat{q}) = ar g max_{x satisfying C_{d}} \textsc D E U (x; \hat{q})

\textsc D E U (\hat{q}) = E_{⟨ T, y, u_{d} ⟩ \sim D_{test}} [\textsc D E U (x^{*} (\hat{q}); q)],

\textsc D E U (\hat{q}) = E_{⟨ T, y, u_{d} ⟩ \sim D_{test}} [\textsc D E U (x^{*} (\hat{q}); q)],

\frac{\partial \textsc D E U ( q ^ )}{\partial q ^} = E_{⟨ T, y, u_{d} ⟩ \sim D_{train}} [\frac{\partial \textsc D E U ( x ^{*} ( q ^ ) ; q )}{\partial x ^{*} ( q ^ )} \frac{\partial x ^{*} ( q ^ )}{\partial q ^}] .

\frac{\partial \textsc D E U ( q ^ )}{\partial q ^} = E_{⟨ T, y, u_{d} ⟩ \sim D_{train}} [\frac{\partial \textsc D E U ( x ^{*} ( q ^ ) ; q )}{\partial x ^{*} ( q ^ )} \frac{\partial x ^{*} ( q ^ )}{\partial q ^}] .

\hat{ϕ} (y_{i}) = ϕ arg max a \in A \prod Pr (Categorical (exp (w p_{i} + ϕ)) = a) .

\hat{ϕ} (y_{i}) = ϕ arg max a \in A \prod Pr (Categorical (exp (w p_{i} + ϕ)) = a) .

i \in T \sum (1 - x^{*} (\hat{q})_{i}) exp (w x^{*} (\hat{q})_{i} + \hat{ϕ} (y_{i})) u_{d} (i) .

i \in T \sum (1 - x^{*} (\hat{q})_{i}) exp (w x^{*} (\hat{q})_{i} + \hat{ϕ} (y_{i})) u_{d} (i) .

\frac{z _{0} + ϵ}{z _{1} + ϵ}

\frac{z _{0} + ϵ}{z _{1} + ϵ}

\frac{( 1 - ( z _{1} - ϵ )) z _{1}}{( 1 - ( z _{0} - ϵ )) z _{0}}

\frac{( 1 - ( z _{1} - ϵ )) z _{1}}{( 1 - ( z _{0} - ϵ )) z _{0}}

g (p) - f (p) \leq δ

g (p) - f (p) \leq δ

g (p^{'})

g (p^{'})

= f (p^{*}) - (p^{'} - p^{*}) ϵ + δ

\leq g (p^{*}) - (p^{'} - p^{*}) ϵ + δ (since f (p) \leq g (p) holds for all p)

g (p) - f (p)

g (p) - f (p)

= [(1 - p) (1 - ϵ) - p ϵ] q

\leq q

\frac{e ^{λ (1 - ϵ) (1 - p)}}{e ^{λ ϵ p} + e ^{λ (1 - ϵ) (1 - p)}} = \frac{1}{1 + e ^{λ [ϵ p - (1 - ϵ) (1 - p)]}}

\frac{e ^{λ (1 - ϵ) (1 - p)}}{e ^{λ ϵ p} + e ^{λ (1 - ϵ) (1 - p)}} = \frac{1}{1 + e ^{λ [ϵ p - (1 - ϵ) (1 - p)]}}

g (x) \leq 0

g (x) \leq 0

A x = 0

μ \geq 0

μ ⊙ g (x)

\nabla_{x} f (x, θ) + μ^{⊤} \nabla g (x) + ν^{⊤} A = 0

\frac{\partial x}{\partial θ} \frac{\partial μ}{\partial θ} \frac{\partial ν}{\partial θ} = - K^{- 1} \frac{\partial \nabla _{x} f ( x , θ )}{\partial θ} 00

\frac{\partial x}{\partial θ} \frac{\partial μ}{\partial θ} \frac{\partial ν}{\partial θ} = - K^{- 1} \frac{\partial \nabla _{x} f ( x , θ )}{\partial θ} 00

K = \nabla_{x}^{2} f (x, θ) + \sum_{i = 1}^{n_{in e q}} μ_{i} \nabla_{x}^{2} g (x) d ia g (μ) (\frac{\partial g ( x )}{\partial x}) A (\frac{\partial g ( x )}{\partial x})^{T} d ia g (g (x)) 0 A^{T} 00

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTerrorism, Counterterrorism, and Political Violence · Adversarial Robustness in Machine Learning · Crime, Illicit Activities, and Governance

Full text

End-to-End Game-Focused Learning of Adversary Behavior in Security Games

Andrew Perrault,1 Bryan Wilder,1 Eric Ewing,2 Aditya Mate,1 Bistra Dilkina,2 Milind Tambe1

1Center for Research on Computation and Society, Harvard

2Center for Artificial Intelligence in Society, University of Southern California

[email protected], [email protected], [email protected],

[email protected], [email protected], [email protected]

Abstract

Stackelberg security games are a critical tool for maximizing the utility of limited defense resources to protect important targets from an intelligent adversary. Motivated by green security, where the defender may only observe an adversary’s response to defense on a limited set of targets, we study the problem of learning a defense that generalizes well to a new set of targets with novel feature values and combinations. Traditionally, this problem has been addressed via a two-stage approach where an adversary model is trained to maximize predictive accuracy without considering the defender’s optimization problem. We develop an end-to-end game-focused approach, where the adversary model is trained to maximize a surrogate for the defender’s expected utility. We show both in theory and experimental results that our game-focused approach achieves higher defender expected utility than the two-stage alternative when there is limited data.

1 Introduction

Many real-world settings call for allocating limited defender resources against a strategic adversary, such as protecting public infrastructure (?), transportation networks (?), large public events (?), urban crime (?), and green security (?). Stackelberg security games (SSGs) are a critical framework for computing defender strategies that maximize expected defender utility to protect important targets from an intelligent adversary (?).

In many SSG settings, the adversary’s preferences over targets are not known a priori. In early work, the adversary’s preferences were estimated via the judgments of human experts (?). In domains where there are many interactions with the adversary, we can leverage this history using machine learning instead. This line of work, started by Letchford et al. (?), has received extensive attention in recent years (see related work).

We use protecting wildlife from poaching (?) as a motivating example. The adversary’s (poacher’s) behavior is observable because snares are left behind, which rangers aim to remove (see Fig. 1). Various features such as animal counts, distance to the edge of the park, weather and time of year may affect how attractive a particular target is to the adversary. The training data consists of adversary behavior in the context of particular sets of targets, and our objective is to achieve a high defender utility when we are playing against the same adversary and new sets of targets. For the problem of poaching prevention, Gholami et al. (?) use around 20 features per target and observe tens of thousands of distinct targets (i.e., combinations of feature values). Rangers patrol a small portion of the park each day and aim to predict poacher behavior across a large park consisting of targets with novel feature values.

The standard approach to the problem breaks it into two stages. In the first, the adversary model is fit to the historical data to minimize an accuracy-based loss function, and in the second, the defender covers the targets (via a mixed strategy) to maximize utility against the learned model. It is true that, in a worst-case analysis, a model that is more accurate in a global sense induces a better coverage (see Sinha et al. (?) and Haghtalab et al. (?)), but a model that accurately predicts the relative values of “important” targets may achieve high defender utility with weak global accuracy. For example, in a game with many low-value targets, the estimates of the values of the low-value targets can be wildly inaccurate and still yield a high defender utility (see Sec. 3 for an example).

In our game-focused approach, in contrast to a two-stage approach, we focus on learning a model that yields a high defender expected utility from the start. We train a predictive model end-to-end (i.e., considering the effects of the optimization problem) using an estimate of defender expected utility as our loss function. This approach has the advantage of focusing learning on “important” targets that have a large impact on the defender expected utility, and not being distracted by irrelevant targets (e.g., those with low value for both the attacker and defender). For example, in our human subject data experiments, two-stage achieves 2–20% lower cross entropy, but worse defender expected utility. Performing game-focused training requires us to overcome several technical challenges, including forming counterfactual estimates of the defender’s expected utility and differentiating through the solution of a nonconvex optimization problem.

In summary, our contributions are: First, we provide a theoretical justification for why our game-focused approach can outperform two-stage approaches in SSGs. Second, we overcome technical challenges to develop a game-focused learning pipeline for SSGs. Third, we test our approach on a combination of synthetic and human subject data and show that game-focused learning outperforms a two-stage approach in settings where the amount of data available is small and when there is wide variation in the adversary’s values for the targets.

Related Work.

There is a rich literature on SSGs, ranging from uncertain observability (?) to disguised defender resources (?) to extensive-form models (?) to patrolling on graphs (?; ?). In particular, learning to maximize the defender’s payoff from repeated play has been a subject of extensive study. It is important to distinguish between the active learning case (?; ?; ?), where the defender may gather information through her choice of strategy, and the passive case, where the defender does not have control over the training data. We consider each case to be valuable but focus on the passive case because we believe it is encountered more frequently in domains of interest. In the anti-poaching setting, parks often have historical data that far exceeds what can be actively collected in the short term.

Bounded rationality models are a critical component of the SSG literature because they allow the defender to achieve higher utilities against many realistic attackers. They have been the subject of extensive study since their introduction by Pita et al. (?) (e.g., Cui and John (?) and Abbasi et al. (?), who develop a distinct line of work, inspired by psychology). We focus on the quantal response (QR) (?) model and especially the subjective utility quantal response (SUQR) model (?). SUQR is simple, widely used and has been shown to be effective in practice.

Sinha et al. (?) and Haghtalab et al. (?) provide probabilistic bounds on the learning error for two-stage approaches for generalized SUQR attackers in the passive and weakly active cases, respectively. Both works translate these bounds into the guarantees on the defender’s expected utility in the worst case. Our focus is on the orthogonal issue of how to train any differentiable predictive model end-to-end with gradient descent, including deep learning architectures that are the state of the art for many learning tasks. These methods can scale to many features and complicated relationships and are one of the main appeals of two-stage approaches. We use SUQR implemented on a neural network as an illustrative example, but our approach can be applied to other bounded rationality models, as we discuss in Sec. 4.

Outside of SSGs, Ling et al. (?; ?) use a differentiable QR equilibrium solver to reconstruct the payoffs of both players in a game from observed play. Hartford et al. (?) and Wright and Leyton-Brown (?) study the problem of predicting play in unseen two-player simultaneous-move games with a small number of actions per player, and Hartford et al. (?) build a deep learning architecture for this purpose. These works focus on prediction rather than optimization.

We briefly discuss related work in end-to-end learning for decision-making in non-game-theoretic contexts (see Donti and Kolter (?) for a more complete discussion). New technical issues arise due to the presence of the adversary, such as counterfactual estimation and nonconvexity. In their study of parameter sensitivity, Rockafellar and Wets (?) provide a comprehensive theoretical analysis of differentiating through optimization. Bengio (?) was first to train a learning system for a more complex task by directly differentiating through the outcome of applying parameterized rules. Amos and Kolter (?) provide analytical derivatives for constrained convex problems. This analytic approach is extended to stochastic optimization by Donti et al. (?) and to submodular optimization by Wilder et al. (?). Demirovic et al. (?) provide a theoretically optimal framework for ranking problems with linear objectives.

2 Setting

Stackelberg Security Games (SSGs).

Our focus is on optimizing defender strategies for SSGs, which describe the problem of protecting a set of targets given limited defense resources and constraints on how the resources may be deployed (?). Formally, an SSG is a tuple $\{\mathcal{T},\bm{u}_{d},\bm{u}_{a},C_{d}\}$ , where $\mathcal{T}$ is a set of targets, $\bm{u}_{d}:\mathcal{T}\rightarrow I\!\!R_{\leq 0}$ is the defender’s payoff if each target is successfully attacked, $\bm{u}_{a}:\mathcal{T}\rightarrow I\!\!R_{\geq 0}$ is the attacker’s, and $C_{d}$ is the set of constraints the defender’s strategy must satisfy. Both players receive a payoff of zero when the attacker attacks a target that is defended.

The game has two time steps: the defender computes a mixed strategy that satisfies the constraints $C_{d}$ , which induces a marginal coverage probability (or coverage) $\bm{p}=\{\bm{p}_{i}:i\in\mathcal{T}\}$ . The attacker’s attack function $\bm{q}$ determines which target is attacked, inducing an attack probability for each target. The defender seeks to maximize her expected utility:

[TABLE]

The attacker’s $q$ function can represent a rational attacker, e.g., $\bm{q}_{i}(\bm{p},\bm{u}_{a})=1\textrm{ if }i=\operatorname*{arg\!max}_{j\in\mathcal{T}}(1-\bm{p}_{j})\bm{u}_{a}(j)\textrm{ else }0$ , or a boundedly rational attacker. A QR attacker (?) attacks each target with probability proportional to the exponential of its payoff scaled by a constant $\lambda$ , i.e., $\bm{q}_{i}(\bm{p})\propto\exp(\lambda(1-\bm{p}_{i})\bm{u}_{a})$ . An SUQR (?) attacker attacks each target with probability proportional to the exponential of an attractiveness function:

[TABLE]

where $\bm{y}_{i}$ is a vector of features of target $i$ and $w<0$ is a constant. We call $\phi$ the target value function. We focus our effort on learning $\phi$ because $w$ can easily be learned using existing techniques, such as the maximum likelihood estimation (MLE) approach of Sinha et al. (?), assuming we have the ability to play different defender strategies against the same set of targets. MLE estimates converge rapidly, as shown by Fig. 2, which demonstrates learning in an eight-target game, averaged over 20 trials. Once we have an accurate $w$ estimate, it can be transferred to all games against the same adversary.

Learning in SSGs.

We consider the problem of learning to play against an attacker with an unknown attack function $\bm{q}$ . We observe attacks made by the adversary against sets of targets with differing features, and our goal is to generalize to new sets of targets with unseen feature values.

Formally, let $\langle\bm{q},C_{d},D_{\textrm{train}},D_{\textrm{test}}\rangle$ be an instance of a Stackelberg security game with latent attack function (SSG-LA). $\bm{q}$ , which is not observed by the defender, is the true mapping from the features and coverage of each target to the probability that the attacker attacks that target. $C_{d}$ is the set of constraints that a mixed strategy defense must satisfy for the defender. $D_{\textrm{train}}$ are training games of the form $\langle\mathcal{T},\bm{y},\mathcal{A},\bm{u}_{d},\bm{p}_{\textrm{historical}}\rangle$ , where $\mathcal{T}$ is the set of targets, and $\bm{y}$ , $\mathcal{A}$ , $\bm{u}_{d}$ and $\bm{p}_{\textrm{historical}}$ are the features, observed attacks, defender’s utility function, and historical coverage probabilities, respectively, for each target $i\in\mathcal{T}$ . $D_{\textrm{test}}$ are test games $\langle\mathcal{T},\bm{y},\bm{u}_{d}\rangle$ , each containing a set of targets and the associated features and defender values for each target. We assume that all games are drawn i.i.d. In a green security setting, the training games represent the results of patrols on limited areas of the park and the test games represent the entire park.

The defender’s goal is to select a coverage function $\bm{x}$ that takes the parameters of each test game as input and maximizes her expected utility across the test games against the attacker’s true $\bm{q}$ :

[TABLE]

To achieve this, she can observe the attacker’s behavior in the training data and learn how he values different combinations of features.

Two-Stage Approach.

A standard two-stage approach to the defender’s problem is to estimate the attacker’s $\bm{q}$ function from the training data and optimize against the estimate during testing. This process, which is illustrated in the top of Fig. 3, resembles multiclass classification where the targets are the classes: the inputs are the target features and historical coverages, and the output is a distribution over the predicted attack. Specifically, the defender fits a function $\hat{\bm{q}}$ to the training data that minimizes a loss function. Using the cross entropy, the loss for a particular training example is

[TABLE]

where $\tilde{\bm{q}}=\frac{\mathcal{A}_{i}}{|\mathcal{A}|}$ is the empirical attack distribution and $\mathcal{A}_{i}$ is the number of historical attacks that were observed on target $i$ . Note that we use hats to indicate model outputs and tildes to indicate the ground truth. For each test game $\langle\mathcal{T},\bm{y},\bm{u}_{d}\rangle$ , coverage is selected by maximizing the defender’s expected utility assuming the attack function is $\hat{\bm{q}}$ :

[TABLE]

3 Impact of Two-Stage Learning on DEU

We begin by developing intuitions about when an inaccurate predictive model can lead to high defender expected utility. We study the rational attacker case for simplicity—results in the rational case can be directly translated to the QR case (which is a smooth version of rationality). Consider an SSG with three targets and a single defense resource. The defender has equal value for all three and the attacker has true values of $(0.4,0.4,0.2)$ , yielding an optimal coverage of $p^{*}=(0.5,0.5,0.0)$ . Suppose the defender estimates the attacker’s target values to be $(0.5,0.5,0.0)$ . This estimate yields the optimal coverage, despite overestimating the value of the first two targets by 25% and underestimating the value of the third by 100%. In contrast, the estimate $(0.4-\epsilon,0.4-\epsilon,0.2+2\epsilon)$ does not yield optimal coverage despite being within $\epsilon$ of the ground truth target values.

We characterize the extent to which two predictive models with the same accuracy-based loss can differ in terms of the defender’s expected utility for rational attacker, two-target SSGs with both equal and zero-sum defender target values. From the perspective of a two-stage approach with an accuracy-based loss, any two models with the same loss are considered equally good. In contrast, a game-focused model with an oracle for the defender’s expected utility would automatically prefer a model with higher defender utility. We additionally extend the latter result to QR attackers.

The theory shows two key points. First, the error in estimates of attacker’s utilities can have highly variable effects on the defender’s expected utility. As we saw in the example, estimation error can have no effect in certain cases. The defender’s preference for the distribution of estimation error depends on both the relative values of the targets and the correlation between the target values of the attacker and defender. These properties are challenging to replicate in hand-tuned two-stage approaches. Second, game-focused learning is more beneficial when the attacker’s true values across targets exhibit greater variance. We return to this intuition in our experiments.

We begin with the case where the defender values all targets equally (and recall that we assume that both the attacker and defender receive a payoff of zero for an unsuccessful attack). For complete proofs of all theorems, see the full version of the paper.

Theorem 1 (Equal defender values).

Consider a two-target SSG with a rational attacker, equal defender values for each target, and a single defense resource to allocate, which is not subject to scheduling constraints (i.e., any nonnegative marginal coverage that sums to one is feasible). Let $z_{0}\geq z_{1}$ be the attacker’s values for the targets, which are observed by the attacker, but not the defender, and we assume w.l.o.g. are non-negative and sum to 1. Let the defender’s values for the targets be -1 for each.

The defender has an estimate of the attacker’s values $(\hat{z}_{0},\hat{z}_{1})$ with mean squared error (MSE) $\epsilon^{2}$ . Suppose the defender optimizes coverage against this estimate. If $\epsilon^{2}\leq(1-z_{0})^{2}$ and $\epsilon^{2}\leq(z_{0}-z_{1})^{2}$ , the ratio between the highest $DEU$ under the estimate of $(\hat{z}_{0},\hat{z}_{1})$ with MSE $\epsilon^{2}$ and the lowest $DEU$ is:

[TABLE]

Proof Sketch.

There are two normalized estimates of the attacker’s values that have MSE $\epsilon^{2}$ : $(z_{0}+\epsilon,z_{1}-\epsilon)$ and $(z_{0}-\epsilon,z_{1}+\epsilon)$ . The attacker will attack the target whose value the defender underestimates. The defender prefers the latter case, where the attacker selects the higher value target, because this target has more coverage and successful attacks have the same cost on both targets. ∎

Thus, in the equal value case, it is generally better for the defender to underestimate the attacker’s values for high-value targets. This dynamic is reversed in the zero-sum case.

Theorem 2 (Zero-sum).

Consider the same setting as Thm. 1 except the utilities are zero-sum. If $\epsilon^{2}\leq(1-z_{0})^{2}$ , the ratio between the highest $DEU$ under the estimate of $(\hat{z}_{0},\hat{z}_{1})$ with MSE $\epsilon^{2}$ and the lowest $DEU$ is:

[TABLE]

Proof Sketch.

Similarly to Thm. 1, there are two value estimates with MSE $\epsilon^{2}$ . The defender prefers the case where she underestimates the attacker’s value for the lower value target, inducing the attacker to attack it. The lower cost of failures outweighs the attacker getting caught less often. ∎

The theory can be extended to QR attackers. In the case of Thm. 2, the defender can lose value $z_{0}\epsilon$ , or $\epsilon$ as $z_{0}\to 1$ , compared to the optimum because of an unfavorable distribution of estimation error. We show that this carries over to a boundedly rational QR attacker, with the degree of loss converging towards the rational case as $\lambda$ increases.

Theorem 3 (Zero-sum, QR attacker).

Consider the setting of Thm. 2, but in the case of a QR attacker. For any $0\leq\alpha\leq 1$ , if $\lambda\geq\frac{2}{(1-\alpha)\epsilon}\log\frac{1}{(1-\alpha)\epsilon}$ , the defender’s loss compared to the optimum may be as much as $\alpha(1-\epsilon)\epsilon$ under a target value estimate with MSE $\epsilon^{2}$ .

4 Game-Focused Learning in SSGs

We now present our approach to game-focused learning in SSGs. The key idea is to embed the defender optimization problem into training and compute gradients of DEU with respect to the model’s predictions, which requires us to overcome two technical challenges. First, in the previous section, we assumed we had access to an exact oracle for the defender’s expected utility, but in practice, this is a counterfactual estimation problem. Second, our defender’s optimization is nonconvex and new machinery is required to calculate the derivative of the solution w.r.t. its parameters. We illustrate our approach in the bottom of Fig. 3.

We begin with notation. As we have discussed, the standard two-stage approach may fall short when the loss function (e.g., cross entropy) does not align with the true goal of maximizing expected utility. Ultimately, the defender would like to learn a function $m_{\omega}$ which takes a set of targets and associated features as input and produces $\hat{\bm{q}}$ as output, which then induces a coverage with high expected utility. Note that from a utility-theoretic perspective, it does not matter how accurate $\hat{\bm{q}}$ is, only that the induced coverage has high expected utility. Let

[TABLE]

be the optimal defender coverage function against an adversary with attack function $\hat{\bm{q}}$ . Our goal is to find $\hat{\bm{q}}$ which maximizes

[TABLE]

$\textsc{DEU}(\hat{\bm{q}})$ is the ground truth expected utility of coverage $\bm{x^{*}}(\hat{\bm{q}})$ (recall that $\bm{q}$ is the attacker’s true response function). While we do not have access to $D_{\textrm{test}}$ , we can estimate Expr. 9 using samples from $D_{\textrm{train}}$ . We would like to calculate the derivative of Expr 9 w.r.t. $\hat{\bm{q}}$ to use in model training. Using the chain rule:

[TABLE]

Here, $\frac{\partial\textsc{DEU}(\bm{x^{*}}(\hat{\bm{q}});\bm{q})}{\partial\bm{x^{*}}(\hat{\bm{q}})}$ describes how the defender’s true utility with respect to $\bm{q}$ changes as a function of her strategy $\bm{x}^{*}$ , which is a counterfactual question because we only observe the defender playing a single strategy in this training game. $\frac{\partial\bm{x^{*}}(\hat{\bm{q}})}{\partial\hat{\bm{q}}}$ describes how $\bm{x^{*}}$ depends on the estimated attack function $\hat{\bm{q}}$ , which requires differentiating through the nonconvex optimization problem in Eq. 8. If we had a means of calculating both terms, we could then estimate $\frac{\partial\textsc{DEU}(\hat{\bm{q}})}{\partial\hat{\bm{q}}}$ by sampling games from $D_{\textrm{train}}$ and computing gradients on the samples. If $\hat{\bm{q}}$ is itself implemented in a differentiable manner (e.g., a neural network), the entire system may be trained end-to-end via gradient descent. We address each of the two terms separately.

Counterfactual Adversary Estimates

We want to calculate $\frac{\partial\textsc{DEU}(\bm{x^{*}}(\hat{\bm{q}});\bm{q})}{\partial\bm{x^{*}}(\hat{\bm{q}})}$ which describes how the defender’s true utility with respect to $\bm{q}$ depends on her strategy $\bm{x}^{*}$ . Computing this term requires a counterfactual estimate of how the attacker would react to a different coverage vector than the historical one. We find that typical datasets only contain a set of sampled attacker responses to a particular historical defender mixed strategy or a small set of mixed strategies. Previous work on end-to-end learning for decision problems (?; ?; ?; ?) assumes that the historical data specifies the utility of any possible decision, but this assumption does not hold in SSGs because they are interactions between strategic agents.

Our approach relies on the adversary using a bounded rationality model that is stochastic and decomposable. It is generally the case that boundedly rational adversaries complicate the process of learning and optimizing in SSGs, e.g., because they cause the optimization to become nonconvex and they add uncertainty to the defender’s adversary model. However, bounded rationality is critical to our counterfactual reasoning strategy because boundedly rational adversaries reveal information about their entire ranking of targets over repeated games against the same defender strategy. For example, consider a three-target game where the defender has covered all three targets equally. QR attackers attack each target proportionally to the expected utility it provides, eventually revealing the attacker’s relative utilities across all of the targets under that particular defender coverage. Without the stochasticity, we would unable to learn anything other than the attacker’s most preferred target.

The resulting target value estimates are in the context of one particular defender strategy. To estimate the attacker’s response to any defender coverage, we need to substitute the historical coverage for an arbitrary one. At first glance, this may seem impossible for a stochastic, bounded rationality model because the attacker could have an arbitrary response to coverage. If we had a rational attacker instead, with known target values, we could compute his reaction to an arbitrary defender coverage, but we could not estimate his relative values for each target (as previously discussed). Here we exploit the decomposability of many bounded rationality models: the impact of the defender’s coverage can be separated from the values of the targets.

We develop an illustrative example of the pipeline for SUQR. We observe samples from the attack distribution $\bm{q}$ , where for SUQR, $\bm{q}_{i}\propto\exp(w\bm{p}_{i}+\phi(\bm{y}_{i}))$ . Because we can estimate $\bm{q}_{i}$ from the empirical attack frequencies and the term $w\bm{p}_{i}$ is known (see Sec. 2), we can invert the $\exp$ function to obtain an estimate of $\phi(\bm{y}_{i})$ . Formally, this corresponds to setting $\hat{\phi}(\bm{y}_{i})$ to the MLE under the empirical attack distribution:

[TABLE]

By exploiting decomposability, we derive relative target value estimates that can be used to estimate the attacker’s behavior under an arbitrary coverage. Our estimates have two key limitations. First, they do not provide us with any information about the $\phi$ function for values other than $\bm{y}_{i}$ and second, they are unique only up to a constant additive factor. Despite these limitations, they suffice to allow us to simulate the defender’s expected utility for any training data point $\langle\mathcal{T},\bm{y},\mathcal{A},\bm{u}_{d},\bm{p}_{\textrm{historical}}\rangle$ as

[TABLE]

We briefly discuss two issues that arise when applying this procedure to other bounded rationality models. First, the model needs to provide meaningful $\hat{\phi}$ estimates, which is where the rational attacker model fails. Second, the model needs to be decomposable into the effects of coverage and the inherent attractiveness of the targets, and the parameters of this decomposition need to be easily estimable (as we show is the case for SUQR in Sec. 2). Most models satisfy this condition, including SHARP (?), PT and QBRM (?).

Gradients of Nonconvex Optimization

The optimization problem which produces $\bm{x}^{*}(\hat{\bm{q}})$ is typically nonconvex when the adversary is boundedly rational. This complicates the process of differentiating through the defender problem to obtain $\frac{\partial\bm{x^{*}}(\hat{\bm{q}})}{\partial\hat{\bm{q}}}$ , as previous approaches rely on either a convex optimization problem (?) or a cleverly chosen convex surrogate for a nonconvex problem (?). In contrast, our approach produces correct gradients for many nonconvex problems. The key idea is to fit a quadratic program around the optimal point returned by a blackbox nonconvex solver. Intuitively, this works well when the local neighborhood is, in fact, convex, and fortunately, this is the case for many optimization problems against boundedly rational attackers.

Specifically, we consider the generic problem $\min_{\bm{x}\in\mathcal{X}}f(\bm{x},\theta)$ where $f$ is a (potentially nonconvex) objective which depends on a learned parameter $\theta$ . $\mathcal{X}$ is a feasible set that is representable as $\{x:g_{1}(\bm{x}),\ldots,g_{m}(\bm{x})\leq 0,h_{1}(\bm{x}),\ldots,h_{\ell}(\bm{x})=0\}$ for some convex functions $g_{1},\ldots,g_{m}$ and affine functions $h_{1},\ldots,h_{\ell}$ . We assume there exists some $\bm{x}\in\mathcal{X}$ with $\bm{g}(\bm{x})<0$ , where $\bm{g}$ is the vector of constraints. In SSGs, $f$ is the defender objective DEU, $\theta$ is the attack function $\hat{\bm{q}}$ , and $\mathcal{X}$ is the set of $\bm{x}$ satisfying $C_{d}$ . We assume that $f$ is twice continuously differentiable. These two assumptions capture smooth nonconvex problems over a nondegenerate convex feasible set.

Suppose that we can obtain a local optimum of $f$ . Formally, we say that $\bm{x}$ is a strict local minimizer of $f$ if (1) there exist $\bm{\mu}\in R^{m}_{+}$ and $\bm{\nu}\in R^{\ell}$ such that $\nabla_{\bm{x}}f(\bm{x},\theta)+\bm{\mu}^{\top}\nabla\bm{g}(\bm{x})+\bm{\nu}^{\top}\nabla\bm{h}(\bm{x})=0$ and $\bm{\mu}\odot\bm{g}(\bm{x})=0$ and (2) $\nabla^{2}f(\bm{x},\theta)\prec 0$ . Intuitively, the first condition is first-order stationarity, where $\bm{\mu}$ and $\bm{\nu}$ are dual multipliers for the constraints, while the second condition says that the objective is strictly convex at $\bm{x}$ (i.e., we have a strict local minimum, not a plateau or saddle point). We prove the following:

Theorem 4.

Let $\bm{x}$ be a strict local minimizer of $f$ over $\mathcal{X}$ . Then, except on a measure zero set, there exists a convex set $\mathcal{I}$ around $\bm{x}$ such that $\bm{x}_{\mathcal{I}}^{*}(\theta)=\arg\min_{x\in\mathcal{I}\cap\mathcal{X}}f(\bm{x},\theta)$ is differentiable. The gradients of $\bm{x}_{\mathcal{I}}^{*}(\theta)$ with respect to $\theta$ are given by the gradients of solutions to the local quadratic approximation $\min_{\bm{x}\in\mathcal{X}}\frac{1}{2}\bm{x}^{\top}\nabla^{2}f(\bm{x},\theta)\bm{x}+\bm{x}^{\top}\nabla f(\bm{x},\theta)$ .

This states that the local minimizer within the region output by the nonconvex solver varies smoothly with $\theta$ , and we can obtain gradients of it by applying existing techniques (?) to the local quadratic approximation. It is easy to verify that the defender utility maximization problem for an SUQR attacker satisfies the assumptions of Thm. 4 since the objective is smooth and typical constraint sets for SSGs are polytopes with nonempty interior (see (?) for a list of examples). Our approach is quite general and applies to a range of behavioral models such as QR, SUQR, and SHARP since the defender optimization problem remains smooth in all.

5 Experiments

Bibliography38

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[2015] Abbasi, Y. D.; Short, M.; Sinha, A.; Sintov, N.; Zhang, C.; and Tambe, M. 2015. Human adversaries in opportunistic crime security games: evaluating competing bounded rationality models. In Proc. of Advances in Cognitive Systems .
2[2016] Abbasi, Y. D.; Ben-Asher, N.; Gonzalez, C.; Morrison, D.; Sintov, N.; and Tambe, M. 2016. Adversaries wising up: Modeling heterogeneity and dynamics of behavior. In Proc. of Intl. Conf. on Cognitive Modeling .
3[2017] Amos, B., and Kolter, J. Z. 2017. Opt Net: Differentiable optimization as a layer in neural networks. In ICML-17 .
4[2017] Basilico, N.; De Nittis, G.; and Gatti, N. 2017. Adversarial patrolling with spatially uncertain alarm signals. AIJ 246:220–257.
5[2012] Basilico, N.; Gatti, N.; and Amigoni, F. 2012. Patrolling security games: Definition and algorithms for solving large instances with single patroller and single intruder. AIJ 184:78–123.
6[1997] Bengio, Y. 1997. Using a financial training criterion rather than a prediction criterion. Intl. J. of Neural Systems 8:433–443.
7[2017] Blum, A.; Haghtalab, N.; and Procaccia, A. D. 2017. Learning to Play Stackelberg Security Games . Cambridge University Press. 604––626.
8[2016] Cermak, J.; Bosansky, B.; Durkota, K.; Lisy, V.; and Kiekintveld, C. 2016. Using correlated strategies for computing Stackelberg equilibria in extensive-form games. In AAAI-16 .

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

End-to-End Game-Focused Learning of Adversary Behavior in Security Games

Abstract

1 Introduction

Related Work.

2 Setting

Stackelberg Security Games (SSGs).

Learning in SSGs.

Two-Stage Approach.

3 Impact of Two-Stage Learning on DEU

Theorem 1** (Equal defender values).**

Proof Sketch.

Theorem 2** (Zero-sum).**

Proof Sketch.

Theorem 3** (Zero-sum, QR attacker).**

4 Game-Focused Learning in SSGs

Counterfactual Adversary Estimates

Gradients of Nonconvex Optimization

Theorem 4**.**

5 Experiments

Theorem 1 (Equal defender values).

Theorem 2 (Zero-sum).

Theorem 3 (Zero-sum, QR attacker).

Theorem 4.