Worst-Case Analysis for a Leader-follower Partially Observable Stochastic Game
Yanling Chang, Chelsea C. White III

TL;DR
This paper introduces a worst-case analysis framework for leader-follower partially observable stochastic games, providing a method to compute lower bounds on the leader's value function under uncertainty and limited knowledge.
Contribution
It develops a novel worst-case analysis approach for leader-follower stochastic games with partial observability, including a computational method for lower bound estimation.
Findings
Provides a new solution procedure for the leader's value function.
Demonstrates the approach on a liquid egg production security problem.
Offers insights into decision-making under uncertainty with limited information.
Abstract
Partially observable stochastic games provide a rich mathematical paradigm for modeling multi-agent dynamic decision making under uncertainty and partial information. However, they generally do not admit closed-form solutions and are notoriously difficult to solve. Also, in reality, each agent often does not have complete knowledge of the other agent. This paper studies a leader-follower partially observable stochastic game where the leader has little knowledge of the adversarial follower's reward structure, level of rationality, and process for gathering and transmitting data relevant for decision making. We introduce the worst-case analysis to the partially observable stochastic game to cope with this lack of knowledge and determine the best worst-case value function of the leader. The resulting problem from the leader's perspective has a simple sufficient statistic; however,…
| 916 | 906 | 0 | 756 | 0 |
| 723 | 906 | 916 | 756 | 703 |
| 746 | 906 | 786 | 756 | 726 |
| -100 | -100 | -100 | -100 | -100 |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGame Theory and Applications · Simulation Techniques and Applications · Complex Systems and Decision Making
Worst-Case Analysis for a Leader-follower Partially Observable Stochastic Game
Yanling Chang
Department of Engineering Technology & Industrial Distribution
Department of Industrial Systems and Engineering,
Texas A&M University, College Station, TX 77843
Chelsea C. White III
H. Milton Stewart School of Industrial & Systems Engineering,
Georgia Institute of Technology, Atlanta, GA 30318,
Abstract
Partially observable stochastic games provide a rich mathematical paradigm for modeling multi-agent dynamic decision making under uncertainty and partial information. However, they generally do not admit closed-form solutions and are notoriously difficult to solve. Also, in reality, each agent often does not have complete knowledge of the other agent. This paper studies a leader-follower partially observable stochastic game where the leader has little knowledge of the adversarial follower’s reward structure, level of rationality, and process for gathering and transmitting data relevant for decision making. We introduce the worst-case analysis to the partially observable stochastic game to cope with this lack of knowledge and determine the best worst-case value function of the leader. The resulting problem from the leader’s perspective has a simple sufficient statistic; however, different from a classical partially observable Markov decision process, the value function of the resulting problem may not be convex. We design a viable and computationally attractive solution procedure for computing a lower bound of the leader’s value function as well as its associated control policy in the finite planning horizon. We illustrate the use of the proposed approach in a liquid egg production security problem.
Keywords: worst-case analysis; partially observable Markov decision process; partially observable stochastic game.
1 Introduction
Stochastic games are classical dynamic game models, where the state of the system evolves on the basis of the current state and actions taken by all agents. However, these models do not consider the fact that in reality, an agent in a multi-agent scenario is likely to have only partial and noise-corrupted data about other agents. Partially observable stochastic games (POSGs) generalize stochastic games by taking into consideration this fact, and provide a rich normative framework for multiple intelligent agents with distinct objectives to dynamically control the operation of a system under uncertainty and partial information. Unfortunately, while game theorists have heavily studied Bayesian games and stochastic games, the literature on POSGs is relatively sparse. POSGs generally do not admit closed-form solutions and suffer from significant computational challenges. Tractable algorithms for computing control policies for general-sum POSGs are still rare, and both the Artificial Intelligence (AI) and Operations Research (OR) communities have focused on POSGs with special structures including zero-sum POSGs (Ghosh et al. 2004; Saha 2014) and common-payoff POSGs (Seuken and Zilberstein 2005; Oliehoek and Amato 2016) over the past decades.
A leader-follower POSG introduced in Chang et al. (2015a, 2015b) is a new general-sum POSG, where each agent knows its own state but only has possibly inaccurate/incomplete observation of the other agent’s state. At the beginning of the game, the leader selects its policy first, and then the follower determines its best response policy, with the complete knowledge of the leader’s policy (e.g., adversaries spend time in learning defender’s policies (Information Operations, 2014)). At each stage, the selected policy pair determines actions for each agent simultaneously, and then the system transitions to a new state and each agent receives a new observation. Each agent’s policy selects action to achieve its objective, based on the entire history of its own current and past observations, states, and actions.
The (single-period) leader-follower game, also called the Stackelberg game, has wide applications and has been successfully implemented in many real security problems. A main focus of these applications is to provide decision support to the defender (commonly modeled as the leader) who is protecting a set of critical nodes against adversaries. Such examples include the placement of checkpoints and canine units at Los Angeles International Airport (Jain et al. 2010) and the scheduling of patrols in the Port of Boston (An et al. 2014). However, in the presence of intelligent agents who can respond and adjust their actions in ever-changing environments over time, many of these problems also have a dynamic nature that cannot be fully addressed by single-period games (e.g., an adversary may choose another target if the current target is well protected). The leader-follower POSG is able to explicitly model the dynamic interaction between agents and determine dynamic defense policies for the leader that promptly adjust defensive resource allocation over time for protection, given all available real-time data. This development is consistent with the emerging “Moving Target Defense” belief that nowadays it is impossible to attain perfect security and the aim should be to develop dynamic systems that are defensible rather than perfectly secure (Department of Homeland Security 2015).
Game-theoretic approaches commonly assume that each agent has complete knowledge of both agents’ reward structures, dynamics, and data sensing and transmission systems and that each agent’s objective is to maximize its expected reward criterion. However, these assumptions may not be realistic in many scenarios (Camerer 2011). The intent of an adversary can span a wide range of possibly unknown issues (Bier et al. 2007), perfect rationality is often an unlikely human behavior (March 1978), and action selection may be affected by a variety of issues, such as task complexity, the interplay between emotion and cognition, etc. (Conlisk 1996).
In this paper, we consider a leader-follower general-sum POSG where: (i) the objective of the adversarial follower is unknown to the leader; (ii) at each decision epoch, it is unclear what information (or observations) that the follower has collected and how the follower will make its decision; (iii) the follower can be irrational; and (iv) the state of each agent cannot be precisely observed by the other agent. This is realistic in many situations, for example, when the leader is facing with a new unknown adversary. The intent of this research is to determine the best worst-case value function and the corresponding control strategy for the leader under these circumstances.
The worst-case analysis is a popular approach in single-period games to cope with the uncertainty of an agent’s behavior towards others. For example, Gilboa and Schmeidler (1989), Lo(1996), and Marinacci(2000) examined normal form games where an agent’s action is not exactly known by the other agent. Aghassi and Bertsimas(2006) and Yolmeh and Baykal-Gursoy(2017) used this approach to contend with the payoff uncertainty. Simchi-Levi and Wei(2015) and Caprara et al.(2016) also determined performance benchmarks in one-shot security planning via the same approach. The benchmarks from the worst-case analysis were further used to reveal the value of improved understanding of the behavior of adversaries in single-period security applications (Nguyen et al. 2013). Kardes et al.(2011) introduced robust optimization to stochastic games where reward structure and/or transition probabilities are uncertain. However, the related analysis has not been examined for general-sum POSGs. This research introduces the worst-case analysis to the leader-follower POSG to reduce this gap in the literature.
Contributions of this paper are summarized as follows.
- (i)
We introduce the worst-case analysis to a leader-follower POSG to consider the case where the leader has little knowledge of the adversarial follower. This is a first step for further evaluating the value of an improved understanding of the adversarial follower in a dynamic, multi-agent partially observable stochastic system. 2. (ii)
(Theoretically) We show that the POSG under the worst-case analysis is a single-agent dynamic decision-making problem based solely on information of the leader. This model is unique from the zero-sum POSG and has a sufficient statistic more computationally tractable than the existing sufficient statistics presented in the POSG literature. We investigate the structural properties of the leader’s optimal value function and show that it may not be convex. 3. (iii)
(Computationally) The non-convex structural result limits the usefulness of existing partially observable Markov decision process (POMDP) algorithms in our problem. While the worst-case model can also be viewed as a POMDP with imprecise parameters under the criterion of “maxmin”, currently there are no general algorithms for these problems. In order to establish a bottom-line performance for the leader, we develop a novel backward recursive algorithm to construct a lower bound for the leader’s finite-horizon value function and to determine its associated policy. We evaluate the quality of the solution and show that the lower bound is no worse than the value function associated with the second best leader’s action. We also show that this algorithm can approximately determine a lower bound for the infinite planning horizon problem. 4. (iv)
(Application wise) We illustrate the use of the proposed model and solution procedure via a security problem, where the operations manager of a liquid egg production plant is protecting the facility against an adversary who intends to insert a biological toxin into the system. We test and validate the effectiveness of the developed dynamic defensive strategy using simulation.
This paper is organized as follows. Section 2 presents a literature review on stochastic games with incomplete information, POSGs, and POMDPs with imprecise parameters. In Section 3, we introduce the worst-case analysis model for the general-sum leader-follower POSG, where the objective of the leader is to maximize the expected total discounted reward under the worst-case scenario of the follower. Section 4 presents the structural results of the leader’s optimal value function, followed by their computational implications. In Section 5, we propose a three-step solution procedure for constructing a lower bound of the leader’s value function and its policy, consisting of the PURGE-step, the DOMINANCE-step, and the APPROXIMATION-step. We discuss each of these steps in detail in Sections 6-8. Specifically, Section 6 utilizes an POMDP algorithm to eliminate redundant vectors in constructing the value function for a given leader action; Section 7 presents a geometric approach and a mixed integer program to determine the optimal value function; and Section 8 approximates the resulting value function by a piecewise linear and concave function. The approximation solution is used in the next iteration of the recursive algorithm. We also analyze the error bound of this approach and show our algorithm can be used to construct a lower bound for the leader’s infinite-horizon optimal value function. Section 9 illustrates the use of the developed approach in a security application. Finally, Section 10 summarizes research results and discusses future research directions.
2 Literature Review
This section briefly reviews stochastic games with incomplete information, partially observable stochastic games, and the POMDP with imprecise parameters.
2.1 Stochastic Games with Incomplete Information
Stochastic games were first developed in Shapley(1953). Afterwards, many extensions were developed to consider the incomplete information case where the reward structure and/or transition probabilities are imprecise. For instance, stochastic games with a single non-absorbing state where the payoff follows a given probability distribution were examined by Sorin (1984,1985). Najim et al.(2001) studied the optimization of the limiting average payoff of a zero-sum stochastic game with unknown transition probabilities and average payoffs. Luque-Vasquez and Minjarez-Sosa (2013) and Minjarez-Sosa and Vega-Amaya(2009) considered a zero-sum stochastic game where the payoffs are possibly unbounded. Cheng et al.(2016) developed two approximation methods to solve a two-person zero-sum stochastic game where the payoff matrix entries are independent and normally distributed. Kardes et al.(2011), Kardes(2014) and Rosenberg et al.(2004) analyzed equilibrium points for stochastic games with uncertain payoffs and/or transition probabilities. In all these studies, the state of the system is perfectly observable to each agent.
2.2 Partially Observable Stochastic Games
POSGs are stochastic games with imperfect information where the state of the system is partially observed. A POSG can be transformed to a normal-form game and theoretically solved by iterated elimination of dominated strategies; however, this representation is often too large and not computationally feasible (Hansen et al. 2004). Also, it is impossible to transform POSGs into completely observable stochastic games over belief states, analogous to how a POMDP is solved by transforming it into a MDP over belief states (Hansen et al. 2004). At each stage, each agent will receive a unique observation, leading to various different, possibly conflicting belief states (Emery-Montemerlo et al. 2004). Moreover, in a multi-agent system, each agent must also consider other agents’ beliefs to reason their actions. Even worse, each agent must reason about the beliefs that other agents hold about each other’s beliefs, leading to infinitely nested beliefs. As a result, POSGs suffer significant computational challenges: finding an optimal solution or even computing solutions with absolutely bounded error to common-payoff POSGs is -complete (Bernstein et al. 2002; Rabinovich et al. 2012); Goldsmith and Mundhenk(2008) further showed that the complexity of competitive POSGs can rise to (problems are solvable by a machine using an set as an oracle).
There are no known tractable algorithms for computing optimal (or even reasonable) policies for genera-sum POSGs. Nevertheless, the AI community has made tremendous progress for special classes of POSGs. For example, for two-agent zero-sum POSGs, Ghosh et al.(2004) and Saha (2014) showed that the POSG can be transformed to a stochastic game if both agents share a single observation process. Wiggers et al.(2016) analyzed the structural properties of value functions for zero-sum POSGs. However, algorithms for determining optimal policies for zero-sum POSGs are still rare. For common-payoff POSGs, also called decentralized POMDP, numerous exact and approximation algorithms have been developed (see reviews in Seuken and Zilberstein 2005; Oliehoek 2012; Oliehoek and Amato 2016).
The leader-follower POSG in Chang et al. (2015a, 2015b) is a general-sum POSG where the leader-follower relationship makes the POSG both theoretically and computationally attractive. Assuming each agent uses finite-memory policies and has complete knowledge of the game setup, there is a finite-dimensional sufficient statistic (the belief state) that consolidates all available information history for each agent to make a decision. The belief state is not defined on the state space; rather, it is an agent’s belief over the set of all possible finite information histories of the other agent. Consequently, an agent can infer via its belief state both the system’s state and the other agent’s action. The existence of the sufficient statistics further allows for transforming the general-sum POSG to special structured POMDPs. This model is applied to assess the value of misinformation and disinformation in modern warfare (Chang et al. 2019). While the existing work assumes that each agent has complete knowledge of the game setup (e.g., reward structure, perfect rationality), this paper examines the case where the leader has limited knowledge of the follower (i.e., POSG with incomplete information).
Another related model is called Interactive POMDPs (Gmytrasiewicz and Doshi 2004). The Interactive POMDP (I-POMDP) is another generalization of POMDPs to multi-agent systems. In this framework, agents are described by a class of possible models, and these agent models are included in the definition of “interactive states”. An agent’s belief over these interactive states is a sufficient statistic. However, an agent’s belief is also a component of the other agents’ models. Thus, these beliefs are infinitely nested, making the problem computationally complex. Existing approximation solution techniques include policy iteration (Sonu and Doshi 2012), point based value iteration (Doshi and Perez 2008), and interactive particle filtering (Doshi and Gmytraslewicz 2009).
2.3 Partially Observable Markov Decision Processes with Imprecise Parameters
POMDPs with imprecise parameters were analyzed in Itoh and Nakamura (2007) under the notion of “second-order beliefs” which are beliefs in the imprecisely specified transition probabilities and observation probabilities. The authors determined a set of optimal policies, each of which is optimal to at least one of such second-order beliefs. Saghafian (2018) examined the structural results of ambiguous POMDPs using “-maxmin” expected utility. Within the “maxmin” criterion, Osogami (2015) proved that the value function can still be convex using the Loomis’ Minimax Theorem, given the uncertainty set of the POMDP model parameters is convex. Under the assumption of S-rectangularity, Rasouli and Saghafian (2018) defined a robust POMDP, developed dynamic programming equations for it, and showed a zero-sum POSG can be transformed to a robust POMDP. However, to our best knowledge, there are no general algorithms for POMDPs with imprecise parameters yet (under the criterion of “maxmin”). We further remark that none of these assumptions necessarily holds in our problem (especially when facing with unknown adversaries in a security context) and we study general-sum leader-follower POSGs.
3 The Worst-Case Analysis Modeling
We consider a partially observable stochastic game with a leader () and a follower (). At each stage, the leader first selects an action, and then the follower determines its action. Once the two actions are selected, the reward for each agent is realized and the system transitions to a new state according to a given probability. However, the leader’s information on the adversarial follower is very limited. Specifically,
- (i)
Decision Horizon: The decision epochs are where . 2. (ii)
State Space: the leader’s state and the follower’s state . The state spaces and are assumed to be finite. Each agent knows its own state but may have only inaccurate observations of the other agent’s state. 3. (iii)
Action Space: at each stage, the leader selects and the follower selects its (true) action , assuming the action spaces and are finite. How the follower selects its action is unknown to the leader, and the follower is possibly irrational. 4. (iv)
Observation Space: at each stage, the leader receives noisy observation of the follower’s state , where the leader’s observation space is finite. What observations that the follower may collect is unclear to the leader. 5. (v)
The State Transition and Observation Probabilities: The conditional probability for the leader is assumed given. The follower’s observation probabilities are unknown. 6. (vi)
Reward Structure: the leader’s scalar reward is , given the state pair and action pair ; the follower’s reward is unknown.
The problem objective is to determine the best worst-case value function for the leader and its associated policy, given the above assumptions.
Because the follower only chooses after the selection of the leader’s action and may not be observable, the worst-case analysis assumes that at each stage , the leader
- (i)
first predicts the follower’s worst response action for each (note, ); 2. (ii)
determines its best leader’s action (see Figure 1).
Thus, at time , the leader’s information history is , where is the leader’s prior probability mass vector over and . Note that the “predicted” follower’s action is also a decision variable of the leader, same as , and .
The criterion we consider is the expected total discounted reward accrued over horizon . Namely, for the finite horizon case and for the infinite horizon case, where is the expectation operator conditioned on , and is the discount factor. We assume for the infinite horizon case in order to ensure that is well defined. The problem objective is to determine a policy pair for the leader such that
[TABLE]
where is the policy space of agent .
The POSG under the worst-case scenario results in a single-agent dynamic decision making problem: the leader determines both and the worst-case action on the basis of . Thus, this model is fundamentally different from a zero-sum POSG (which itself is a challenging problem; see Section 2). In zero-sum POSGs, agent makes its decision based on its own private information history . That is, assuming the follower is rational in the zero-sum POSG, the follower will select its (true) action on the basis of , whereas in the worst-case analysis, the leader has no knowledge of and “predicts” the worst-case action based on the leader’s knowledge . Secondly, the follower may not be perfectly rational in our worst-case model.
Furthermore, knowing the exact information history that an adversary has is hard: can be or other possibly unknown forms. It is also unclear how the follower will utilize to make its decision (e.g., the follower is myopic or irrational). The worst-case analysis requires no knowledge of (i) the follower’s private information history , and (ii) how the follower selects its action . Instead, the worst-case modeling analyzes from the leader’s perspective; namely, at epoch , the leader predicts the follower’s worst response action and selects its best action all based on its own information history . As a result, Eq. (1) is the baseline performance of the leader; i.e., for each given , which is exactly the objective of the “worst-case analysis” of this paper.
4 Structural Results
Let be the maximal value of the worst-case expected total discounted reward to be accrued from epoch until , given information history , then can be described recursively by
[TABLE]
Let , where . Thus, is a “belief” array indicating the leader’s inference about the follower’s state . Furthermore, the leader’s belief process is a controlled Markov process as there is a function depending on and such that
[TABLE]
where
[TABLE]
Denote the bottom of Eq. (4) by
[TABLE]
assuming it is non-zero. Let be the set of all bounded, real-valued functions on having supremum norm , where . Then is a Banach space. We say a real-valued function for a fixed is piecewise linear on if there exists a set depending on , such that: , there is a satisfying , where . Now, define the operator as
[TABLE]
where .
Proposition 1**.**
[TABLE]
Thus, is dependent on only through , and is a sufficient statistic for . Furthermore, if is piecewise linear in , then is also piecewise linear in .
Proof.
The first part follows by mathematical induction and the fact that both and are functions of . For the second part, assume is piecewise linear. Equivalently, there exists a finite set such that
[TABLE]
where the function defines the index of the vector corresponding to . We say , if is of the form
[TABLE]
where . Note,
[TABLE]
where . Hence, is piecewise linear in for . ∎∎
We remark that this sufficient statistic is more computationally tractable than other sufficient statistics presented in the POSG literature. In a multi-agent system, not only do agents have to infer the underlying system state, but also must consider other agents’ beliefs in order to infer the other agents’ actions. As a result, the existing sufficient statistics are often complex and in high dimensions: they usually include a belief over the other agent’s (finite) information history where under the finite-memory assumption (Chang et al. 2015a), or a belief over the other agent’s complete model (Gmytrasiewicz and Doshi 2004). The worst-case modeling requires no such assumptions and the leader “predicts” the follower’s worst-case action based on its own knowledge, resulting in a much simpler sufficient statistic and a more computationally attractive problem.
The worst-case analysis transforms a general-sum POSG to a single agent problem, and hence it is relatively simple compared to multi-agent problems. Further, Eq. (6) is very similar to a POMDP with the newly defined and , suggesting that solution procedures for the POMDP may be relevant to our problem. Unfortunately, the fact that the value function of a POMDP is both piecewise linearity and convexity forms the basis for existing POMDP algorithms (Sondik 1971; Cheng 1988; Pineau et al. 2003; Shani et al. 2013). While Proposition 1 guarantees that the operator in Eq. (6) preserves piecewise linearity, convexity will not be preserved, as illustrated by the following example. The non-convexity issue significantly limits the usefulness of solution procedures for the POMDP, and also makes our problem unique from the POMDP.
Consider and let . Fix a , then where (black line), (blue line), and is the leader’s belief vector over the follower’s state, . Clearly, is non-convex in (red line). See Fig. 2.
For the infinite planning horizon, it is easy to show that
Proposition 2**.**
, the operator is a contraction mapping on having modulus .
Proof.
Let , fix and assume , then
[TABLE]
For any , let a^{F,*}\in\arg\min\bigg{[}x^{L}r^{L}(s^{L},a)+\beta\sum_{z^{L^{\prime}}}\sum_{s^{L^{\prime}}}\sigma^{L}(z^{L^{\prime}},s^{L^{\prime}},s^{L},x^{L},a)\\ v(s^{L^{\prime}},\lambda^{L}(z^{L^{\prime}},s^{L^{\prime}},s^{L},x^{L},a))\bigg{]} then
[TABLE]
Thus, . Repeating this argument in the case that shows that . Taking the supremum over and gives ∎∎
As a result, there is a unique fixed point such that . Let the sequence be defined as , then , given . Moreover, is continuous. However, it is not guaranteed that is convex or concave.
We remark that the worst-case model can also be viewed as a POMDP with imprecise parameters, where the leader’s reward and dynamics are imprecise due to the unknown . However, to our best knowledge, general algorithms do not exist for such problems with the “maxmin” criterion (the existing literature focuses on imposing conditions to avoid the non-convexity issue; see Section 2). In this paper, we propose a novel computationally attractive solution procedure without these conditions that could generate a lower bound for the finite-horizon value function with quantified error bound, and determine its associated control policy. Research on the infinite horizon case is a topic for future research.
5 Lower Bound Solution Approach and Policy Determination
We motivate the lower bound solution approach by a rewrite of Eq. (6) to two steps: the “min” step and the “max” step. Specifically, , let the operator be
[TABLE]
then,
[TABLE]
For the “min” step, assume is a piecewise linear and concave approximation of satisfying (i.e., ). Namely, there is a finite set such that . Then, pick any ,
[TABLE]
where if . Thus, is piecewise linear and concave, and the “min” step is a POMDP step.
For the “max” step, since , we have the following:
Theorem 1**.**
There is a finite set of arrays only depends on such that
[TABLE]
Proof.
See the paragraphs above the Theorem. ∎∎
Policy Determination: Each element of is associated with a pair of action . Denote . Each set corresponds to a leader’s action , and each vector is associated with a follower’s worst-case action . The best worst-case policy for can thus be determined by the following steps:
- (i)
determine ; 2. (ii)
for each , find in , and let ; 3. (iii)
determine in ; 4. (iv)
select the action pair associated with .
In a nutshell, at time t, if there is a being a piecewise linear concave approximation of , , we can determine and by POMDP algorithms and Theorem 1, respectively. Moreover,
Proposition 3**.**
If , then .
Proof.
This is due to the definition of and . ∎∎
As is again non-convex, we then need to find another to best approximate and repeat the process for time . Fig. 3 presents a three-step procedure to implement this idea for the finite planning horizon problem.
Specifically, given being a piecewise linear concave approximation of , is a POMDP step for . As in all POMDP problems, the set may contain many redundant -vectors never used in determining . The PURGE-step is to remove all redundant -vectors in each , which can be accomplished by the PURGE operator in the POMDP literature. According to Theorem 1 and , we could set where ; however, the resulting set can be very large. The DOMINANCE-step is to determine the set that has the smallest cardinality and satisfies for computational advantage. The DOMINANCE operator is designed via computational geometry and mixed integer programs (MIPs). Because the resulting is again non-convex, the APPROXIMATION-step approximates with quantified error by a piecewise linear and concave function satisfying for the next iteration.
Performing all required operations and approximation, we have developed a backward recursive algorithm for determining a lower bound of the leader’s best worst-case value function and its associated control policy for the finite-horizon POSG. The pseudocode of the entire procedure is summarized in Algorithm 1. The rest of the paper presents each step in more detail.
6 PURGE Operation
Assume a piecewise linear and concave function satisfying is given. That is, there is a set for each where . For any , we have shown that where
[TABLE]
A large number of -vectors could be generated in this step, however, only a small number of these vectors define . A is redundant if and only if for all , there is a such that ; a is referred to as a defining vector for if there is a belief point such that and these belief points are called witness points (Lin et al. 2004). The PURGE operators in Cassandra et al.(1997), Lin et al.(2004) (and many others) in the POMDP literature can be employed to efficiently remove redundant vectors from . We adopt the PURGE operator in Lin et al.(2004) in our illustrative example.
7 DOMINANCE Operation
We now determine where by extending the notion of redundancy of a -vector to a set. For a given , a set is dominated by on if and only if , there is always a set in such that where ; a set is referred to as supporting if there is at least one belief point such that . For example, both sets and in Fig. 4 are dominated sets, while sets and are supporting for . We seek to remove all dominated sets in order to define efficiently.
Let the DOMINANCE operator be that only contains supporting sets of . We present a two-step procedure for the DOMINANCE operator. We say a set is pair-wise dominated by a set if and only if , . For example, set is pair-wise dominated by the set in Fig. 4.
The first step is to build a superset , where for any in , there is no in pair-wise dominating . We show the pairwise dominance relationship between two sets in can be determined efficiently by a sequence of linear programs (LPs), employing a dual relationship between hyperplanes and points. However, a set in could still be a dominated set, such as in Fig. 4. The second step employs a MIP to further remove all dominated sets in . The first step is to substantially decrease the number and the size of MIPs encountered in the second step.
7.1 Determine the Superset
In computational geometry, the dual of a hyperplane in the primal space is the point , and the dual of a point is a hyperplane . The lower envelope of a given set of hyperplanes is the piecewise linear and concave function , whereas the upper envelope of the given set of hyperplanes is the piecewise linear and convex function . Each of the hyperplanes on the upper envelope in the primal space corresponds to a vertice of the upper convex hull (with respect to the -axis) in the dual space (de Berg et al. 2008; Zhang 2010). We need the following definitions in Zhang (2010) to proceed.
Given a set of points, the convex hull is the set . The surface of the convex hull with negative outernormal directions, the negative convex hull (), is the set , where , and is the closure of . Then,
Lemma 1**.**
Suppose that is closed and bounded, hence, compacted. For any given and , the piecewise linear and concave function , is dual to the set . Namely, for any , there exists a such that , and conversely, for any , there is a such that .
Proof.
It follows the proof of Lemma 1 in Zhang (2010). ∎∎
We now determine whether a set is pair-wise dominated by a set based on the geometric relationship. Without loss of generality, assume is given and there is such that . Pick any and define the set . Proposition 4 shows that determining the dominance relationship is equivalent to check whether is empty for every . The detailed pseudocode is summarized in Algorithm 2.
Proposition 4**.**
* is dominated by if and only if for each , the set is non-empty.*
Proof.
Assume for each , is non-empty. Equivalently, , there is such that . Pick and let . Thus, . Lemma 1 guarantees that there is a satisfying . Thus, and the set is dominated by the set .
Conversely, let be the -vector such that . Equivalently, . As every -vector in is a defining vector for , there is such that . Lemma 1 further guarantees that there is a satisfying . Thus, . Since by assumption, both functions are continuous, and is connected, the two functions intersect over . ∎∎
We remark that we also can determine the dominance on primal space by checking if in the dual space, given . See Proposition 5. Conversely, however, if and intersect on , could be empty in the dual space. A counterexample is given Fig. 5.
Proposition 5**.**
, if , then and intersect on .
Proof.
Pick . Let such that per Lemma 1. The definition of and Lemma 1 guarantee . The result follows by the assumption that ∎∎
Determining the superset on the basis of pairwise dominance needs to consider each pair of action (the pseudocode is summarized in Algorithm 3). For each , the program initializes with the set , where and is randomly generated. Let , be the number of sets in , and be the number of -vectors in . The algorithm updates , and by the following: For each candidate set , the algorithm compares it with the existing sets in . If is pair-wise dominated by an existing set , then will not be considered; Otherwise, will be included in and any existing sets that are dominated by will be eliminated from . Meanwhile, and are updated accordingly.
7.2 Determine the Set
We now determine the set by further removing the dominated sets from . Assume is given. , let be the function value attained by the superset , , and be the value of function attained by the set , . Let DOMINANCE_MIP() determine whether the set is a dominated set, which can be evaluated via the following mixed integer program (1):
[TABLE]
where is a large positive number.
The objective function is to find the minimal gap between the two functions, and . As is a piecewise linear and concave function on , it can be easily determined by the first constraint. The second and the third constraints define . For the purpose of explanation, , we further introduce a variable as the minimum value attained by set , i.e., . Then, (i) and the multiple-choice constraint on s ensure that there is exactly one selected to define ; (ii) , by the definition of . Hence, the combination of (i) and (ii) leads to the second constraint and variables s can be omitted. The last equation ensures that the belief states are in a nonnegative simplex.
Clearly, if the objective value , then is not a supporting set for and should be eliminated. We need to solve number of MIPs to finalize . The pseudocode for determining from its superset is presented in Algorithm 4.
8 Piecewise Linear Concave Approximation
Given a set of -vectors , the value function is piecewise linear but not concave. The iterative algorithm we developed requires a piecewise linear and concave function for the next iteration. We thus approximate by a function satisfying the following conditions:
- (i)
is a piecewise linear and concave on ; 2. (ii)
; 3. (iii)
the distance between and is as small as possible, where we define the distance between two bounded functions as
[TABLE]
Equivalently, for each , we want to determine a set so that satisfies conditions (ii) and (iii). For computational efficiency, we consider the case where in this paper. We do acknowledge that may be further improved by constructing for some instances. Determining a general procedure for finding the best piecewise linear and concave approximation of an arbitrary piecewise linear function is an interesting research topic for the future. Furthermore, an advantage of selecting is that each is still associated with an action pair . Thus, it is easy to explain and implement the policy associated with the lower bound .
We remark that the maximal gap between two functions for a given must occur at (i) where two segments of (or ) intersect, or (ii) extreme points of . Thus, , we could determine the set satisfying conditions (ii) and (iii) by a finite set of belief points . Given , the pseudocode of determining is outlined in Algorithm 5.
We initialize the set in Step 1, by including the extreme points of the belief space and at least a witness point for each -vector in . These witness points are generated by the PURGE operation discussed in Section 6. We develop a concave approximation MIP in Step 2 to construct an initial set based on . As the condition (ii) is only enforced on the set in Step 2, Step 3 further determines if the condition (ii) is violated on . If there is an at which , we update by including . Step 4 determines , the maximal distance between the and . To improve the approximation quality and reduce the gap between and , we also add the belief point at which the maximal distance is attained to the new set . The program continues to update based on . The entire procedure stops when no further improvement is identified. When it stops, the condition (ii) is guaranteed on and the maximal distance between and its approximate value is bounded above by . We now detail each step in the following subsections.
8.1 Concave Approximation on
Assume is given. Let be a set of belief points in , , be the maximum function values attained at by the set , and be the function values attained at by the set , i.e., , and .
For each , define a binary variable if and if . Thus, . Let be the distance between and on the set , that is, . With the aid of additional binary variables for evaluating , we seek the set by the following mixed integer program (2):
[TABLE]
where is a large positive number.
Minimizing the distance between and (on ) is equivalent to minimize the maximal gap . The expression is added to the objective function in order to close the gap on . The multiplier on is to ensure that the two quantities are within the same magnitude. The first to the fourth constraints compute . Specifically, the first constraint ensures that is bounded above by the approximation function constructed by . Each binary variable is associated with a -vector in and a belief point . The second and the third constraints are necessary to guarantee that , there exists one and only one defining vector such that . The fourth constraint implies that if satisfies , then . The fifth constraint is based on the observation that , hence, . The second to the last constraint guarantees that on and the last constraint determines the maximal gap between and on .
We can enhance the performance of the MIP (2) by providing a good feasible solution exploiting the structure results of . Note that for any given , computed by the set is a lower bound to and satisfies all three conditions. Pick any . Let and . Then is a feasible solution to the MIP (2). Determining such initial solutions is straightforward and computationally inexpensive.
8.2 Verification on
The concave approximation MIP (2) only ensures that the condition (ii) is satisfied on . The following mixed integer program (3) further checks whether the condition is satisfied on :
[TABLE]
where is a large positive number.
The objective function is to minimize the difference between the two functions for a given : and its approximation . Thus, MIP (3) is the same as MIP (1) where: (i) the value is determined by the first constraint, and (ii) the second and the third constraints and the binary variable associated with each vector determine .
If at the belief state , then should be added to , and both of the MIPs (2) and (3) should be resolved. The process should continue until .
8.3 Approximation Error
We now determine , the maximal difference between based on and its approximation based on , by the following MIP (4):
[TABLE]
where is a large positive number.
The objective function is to find the maximal gap between and . The first two constraints compute on the basis of . As for a given , the binary variable for each and the multiple-choice constraint on ensure that there is exactly one selected to compute . Meanwhile, for the selected . Similarly, the third and the fourth constraints compute . The binary variable associated with each -vector in and its multiple-choice constraint guarantee that there exists one and only one defining .
The approximation error is bounded above by the objective value , assuming at point . To improve the approximation quality, we also include to update and . Let be the second best leader’s action for . The following Proposition shows that this procedure guarantees . That is, the approximation function at any belief point is no worse than the performance induced by the leader’s second best action (on W). Moreover, the approximation error of the proposed approach could be zero when there is a dominant action of the leader.
Proposition 6**.**
. Furthermore, if there is a leader’s action such that pairwisely dominates , then .
Proof.
The first result follows from that (i) is a feasible solution to the MIP (2); (ii) it is a second best minimizer of ; and (iii) the construction of the set (Step 4 in Algorithm 5). The second result follows as the set satisfies the conditions (i)-(iii) and the pairwise dominance assumption implies . ∎∎
Furthermore, let be the (nonlinear) operator such that , is the approximation of satisfying the conditions (i)-(iii), and . At each iteration, Algorithm 1 evaluates and approximates by . Thus,
Proposition 7**.**
* and .*
Proof.
is obvious by the definition of and the fact that if , then . For the second part, note if , then . By the definition of , . Similarly, we also have . Thus, . Now, . ∎∎
Thus, if we solve the finite horizon problem for a larger and larger , the lower bound function will also converge (and will be a lower bound of the fixed point ). Although we only focus on the finite horizon case in this paper, this result shows that the developed algorithm can also be used to obtain, at least approximately, a lower bound of leader’s value function in the infinite planning horizon. We remark that there is no algorithm for the infinite-horizon general-sum POSG in the literature yet.
9 A Security Application
In this section, we describe a class of dynamic resource allocation problems in security context, where the developed model and solution procedure can be employed.
Consider a defender with limited defensive resources is protecting a set of critical nodes against an intelligent adversary over time. A node could be a manufacturing/factory site, a computer on a network, or a security checkpoint in an airport. At any time, an adversary can choose to breach the security of any node at certain levels. The objective of the defender is to minimize the number of breaches and losses generated by these violations. However, it can be difficult to quantify the reward structure of the adversary, its value over each node, and its rationality. Moreover, due to limited resources and capabilities, the defender may not be fully aware of the attacker state (e.g., exact locations, attack capabilities); but the defender may infer the attacker’s state through reported locations, historical attack records, screening, sensors and detectors, and unstructured text data from social media. On the other hand, the defender’s state is also only partially observable to the adversary in many realistic scenarios (e.g., some defensive resources can be camouflaged). While attacking the system, the adversary may adjust its behavior and target based on its updated information on the defensive resource allocation. In order to provide decision support to the defender, we could model such problems as a leader-follower POSG under the worst-case scenario, where the defender is the leader and the adversary is the follower. Furthermore, our solution procedure can determine dynamic defense policies that timely adjust resource allocation on the basis of all available real-time information in order to protect the system.
Dynamic defense policies are a class of defense policies aligned with the trending philosophy of “Moving Target Defense” (MTD) (Miehling et al. 2015). Currently, the static configuration and operation of a system has presented adversaries with an incredible advantage as adversaries can take their time to study the system and plan attacks (Department of Homeland Security 2015). The focus of MTD is to dynamically change a system in order to shift or reduce the attack surface that can be exploited by adversaries to attack the system (Zhuang et al. 2014).
Many concrete examples in the dynamic risk assessment and security game literature are in this application area, including cybersecurity, traveling inspection (Ahmadi et al. 2018; Bakir and Kardes 2009; Haskell et al. 2014; Kardes 2014; Lopez et al. 2013; Poolsappasit et al. 2012; Shameli-Sendi et al. 2012; Yang et al. 2014; Wang et al. 2019). Here, we use the liquid egg production problem presented in Zhang (2013) for illustration. For the completeness of the paper, we briefly restate the security problem in Section 9.1.
9.1 Problem Description
Liquid egg products are widely used by the food service industry and as ingredients in other food products such as mayonnaise and ice cream (United States Department of Agriculture Food Safety and Inspection Service 2015). A deliberate contamination in the liquid egg products by an adversary will breach food safety, leading to excessive morbidity and mortality. Zhang (2013) identified the critical components of a liquid egg production process, including collecting vats, raw material tanks, pasteurization, and finished product tanks (Fig. 6). An unknown adversary may use this system as a toxin delivery vehicle by inserting a toxin (e.g., botulinum) at these components (“targets”). The consequence of such attacks occurred at each component is defined as the number of contaminated packages, and the numerical values of the consequence have been analyzed in literature.
We now illustrate how to use the developed method to support the production manager with limited resources in selecting a sequence of actions for protecting the system against an unknown adversary, in order to maximize the long-run productivity of the production facility. We allow for multiple attacks and each attack can be successful or unsuccessful. An unsuccessful attack occurs when the adversary launches an attack but fails to insert any toxin to the system (e.g., the adversary is caught by the manager during the attack). Thus, the production process will not be affected, and the manager needs to prepare for next possible attacks. After a successful attack, however, the manager has to stop the production process to remove inserted toxin and clean up the system. Thus, the game stops whenever a successful attack occurs. As the pasteurization process can significantly reduce the effectiveness of the botulinum toxin, we assume the manager needs to protect three targets: Collecting Vat (Target 1), Raw Production Tank (Target 2), and the Finished Product Tank (Target 3).
State spaces, action spaces, and system dynamics: We assume the manager can only protect one target each time (e.g., visit and inspect one critical component each time). Thus, the state of the manager is the target under protection. The state space of the manager is , where the “Attacked” state indicates toxin has been successfully inserted to the system. The manager decides which target to protect dynamically based on its own information data.
The state of the adversary is the location of the adversary. Hence, = {Target 1, Target 2, Target 3, Attacked}, and . At Target , the adversary can either attack the target or switch to another target. Thus, there are 3 actions for each agent (9 action pairs) in each state.
The system transits to a new state once both the agents have selected actions.
Observation space: The manager’s observations of the adversary are the possible locations of the adversary. Thus, , . We assume that the manager has the ability to detect an attack (e.g., by testing) if the attack has successfully occurred. Specifically, the observation matrix is , where and .
Reward structure, criterion, and objective: The system can produce number of qualified packages under normal operations. A successful attack with 2000 grams of botulinum at location can result in number of contaminated packages, {Target 1, Target 2, Target 3}. The numerical values of and are estimated by the simulation model developed in Zhang (2013). If there is no attack, the reward of the manger is its normal productivity . If a successful attack has been detected, no package will be produced because the manager has to stop the production and clean up the system. We assume the manager will receive an additional bonus for successfully preventing an attack. Let () be the probability of having a successful attack at a protected (unprotected) target. Assume . The cleanup cost is for the manager to remove toxin from the system after a successful attack. Thus,
[TABLE]
The criterion of the manager is the expected finite horizon total discounted reward , where we assume and for illustrative purpose. The objective of the manager is to maximize the value of criterion under the worst-case scenario.
9.2 Numerical Results
We first use to illustrate the procedure in Algorithm 1. Table 1 summarizes the -vectors after the PURGE and DOMINANCE operations for “Attacked” at . Thus, . Fig. 1 shows the graph of the true value function and its approximation projected on the non-absorbing states of the follower (i.e., “Attacked”). Clearly, (in blue) is not a concave function and (in red) is indeed the best piecewise linear concave approximation function of . Let be the region where the approximation is accurate, i.e., . Then (in terms of the Lebesgue measure), and the maximal approximation error () occurs around the extreme point .
Fig. 7 shows the convergence result of the overall procedure. The maximum deviation of the value function from
[TABLE]
declines as the algorithm proceeds, and the value function converges to after 27 iterations.
The entire solution procedure was performed on an Intel 3.10 GHz processor having 6.00 GB memory. The total computation time for was 156.31 seconds, where the PURGE, DOMINANCE, and APPROXIMATION operations accounted for 4.78%, 11.86%, and 83.45%, respectively. As at least a witness point was associated with each -vector in the concave approximation MIP (2), the sizes of MIPs in the APPROXIMATION step are significantly larger than those of the MIPs in the PURGE step and DOMINANCE step (could be 1020 times larger). We remark that the dynamic programming approach for POSGs in Hansen et al.(2004) quickly ran out of memory after horizon for a small two-agent problem (the sizes of state, observation, and action spaces for each agent are all two).
To illustrate our results and support the validation of the solution procedure and computational results, we consider the following three scenarios:
- (i)
Scenario 1: The policy for the leader is constructed according to Algorithm 1, and the follower selects the action that minimizes the leader’s criterion value; 2. (ii)
Scenario 2: The policy for the leader is constructed according to Algorithm 1, and the follower randomly selects its action; 3. (iii)
Scenario 3: The leader protects the most vulnerable target (which is the target that generates the largest negative consequence if attacked), and the follower selects the action that minimizes the leader’s criterion value.
We assume the measure of performance is the leader’s total discounted reward for each sample path and present in Fig. 8 the distribution of the performance measure for each scenario, based on 1000 simulations. We note the following from Fig. 8:
- (i)
Scenario 1 has a lower mean than Scenario 2, which illustrates that if the policy for the leader is constructed according to Algorithm 1 and the follower uses policies other than selecting actions to minimize the leader’s criterion value (e.g., irrational), then the performance of the leader will only improve. That is, the proposed algorithm protects the leader against all possible decision making processes of the adversary. 2. (ii)
Scenario 1 has a higher mean than Scenario 3, which illustrates that if the leader uses policies other than the policy constructed according to Algorithm 1 and the follower selects actions that minimize the leader’s criterion, the performance of the leader will degrade. It also shows that constantly protecting the most vulnerable target is not always the best option to the leader in the presence of an intelligent adversary. This is because in a dynamic environment, the adversary will learn the leader’s strategy and thus attacks other targets accordingly. It further illustrates the importance of developing dynamic defensive strategies and defensible systems for protecting against intelligent and adaptive adversaries.
10 Conclusions
This paper introduced a worst-case analysis to a general-sum partially observable stochastic game with two non-cooperative agents, a leader and a follower, where the leader has little knowledge of the follower. A general-sum partially observable stochastic game can be transformed to a simpler single-agent problem under the worst-case scenario; however, it cannot be transformed to a standard POMDP. While the worst-case modeling can be viewed as a POMDP with imprecise parameters, there is no as yet established algorithms for these problems in the literature. In order to determine a baseline performance for the leader, we showed the optimal value function of the leader can be non-convex and developed a recursive algorithm to construct a lower bound of the leader’s value function in the finite horizon and its associated policy. We further analyzed the quality of the lower bound and showed that the proposed procedure can be used to approximately evaluate the leader’s performance in the infinite horizon case. The use of the model and the solution procedure was illustrated by a liquid egg production example in a security context.
The lower bound was constructed by the sets . Future research should further improve the lower bound by efficiently searching . Another research direction would be a detailed discussion for the infinite planning horizon problem. Moreover, the developed analysis provided a benchmark result for an agent in a dynamic, multi-agent partially observable stochastic environment. Thus, the follow-up research may include investigating the value of improved understanding of the adversarial behaviors by comparing with these benchmark results.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] Aghassi M, Bertsimas D (2006), Robust game theory. Math. Program. 107(1), 231-273.
- 2[2] Ahmadi M, Cubuktepe M, Jansen N, Junges S, Katoen J, Topcu U (2018), The partially observable games we play for cyber deception. Retrieved from https://arxiv.org/pdf/1810.00092.pdf .
- 3[3] An B, Shieh E, Yang R, Tambe M, Baldwin C, Di Renzo J, Maule B, Meyer G (2014), Protect-a deployed game theoretic system for strategic security allocation for the United States coast guard. AI Mag. 33(4), 96-110.
- 4[4] Bakir NO, Kardes E (2009), A stochastic game model on overseas cargo container security. Technical Report, May 3.
- 5[5] De Berg M, Cheong O, van Kreveld M, Overmars M (2008), Computational Geometry: Algorithms and Applications. Springer-Verlag, Berlin.
- 6[6] Bernstein DS, Givan R, Immerman N, Zilberstein S (2002), The complexity of decentralized control of Markov decision processes. Math. Oper. Res. 27(4), 819-840.
- 7[7] Bier VM, Oliveros S, Samuelson L (2007), Choosing what to protect. J. Public Econ. Theory . 9(4), 563-587.
- 8[8] Camerer C (2011), Behavioral game theory: Experiments in strategic interaction. Princeton University Press.
