Wasserstein Distributionally Robust Stochastic Control: A Data-Driven   Approach

Insoon Yang

arXiv:1812.09808·math.OC·October 13, 2021

Wasserstein Distributionally Robust Stochastic Control: A Data-Driven Approach

Insoon Yang

PDF

TL;DR

This paper develops a data-driven approach for designing control policies that are robust against distribution errors, using Wasserstein metrics and dynamic programming, with theoretical guarantees and explicit solutions for linear-quadratic cases.

Contribution

It introduces computational algorithms for Wasserstein distributionally robust control, extending performance guarantees from single-stage to multi-stage problems without loss of confidence.

Findings

01

Proposes tractable value and policy iteration algorithms.

02

Provides explicit forms for optimal policies in linear-quadratic problems.

03

Establishes out-of-sample performance guarantees using measure concentration.

Abstract

Standard stochastic control methods assume that the probability distribution of uncertain variables is available. Unfortunately, in practice, obtaining accurate distribution information is a challenging task. To resolve this issue, we investigate the problem of designing a control policy that is robust against errors in the empirical distribution obtained from data. This problem can be formulated as a two-player zero-sum dynamic game problem, where the action space of the adversarial player is a Wasserstein ball centered at the empirical distribution. We propose computationally tractable value and policy iteration algorithms with explicit estimates of the number of iterations required for constructing an $ϵ$ -optimal policy. We show that the contraction property of associated Bellman operators extends a single-stage out-of-sample performance guarantee, obtained using a measure…

Figures4

Click any figure to enlarge with its caption.

Tables2

Table 1. Table 1: Computation time (in seconds) for the investment-consumption problem with different grid sizes

# of states	36	71	141	281
Time (sec)	288. 69	854.61	2086.15	9350.04

Table 2. Table 2: The amount of time (in seconds) required to decrease and maintain the mean frequency deviation less than 1 % percent 1 1\%

Bus	1	2	3	4	5	6	7	8	9	10
$π_{\hat{w}}^{LQG}$	73.5	70.3	59.3	21.5	21.5	24.2	21.3	62.5	36.5	27.7
$π_{\hat{w}}^{'}$	25.0	24.2	19.8	12.4	12.3	11.6	12.2	20.8	14.3	14.3

Equations224

x_{t + 1} = f (x_{t}, u_{t}, w_{t}),

x_{t + 1} = f (x_{t}, u_{t}, w_{t}),

\mbox{\small(SAA-control)}\;\inf_{\pi\in\Pi}\;\mathbb{E}^{\pi}_{w_{t}\sim\nu_{N}}\bigg{[}\sum_{t=0}^{\infty}\alpha^{t}c(x_{t},u_{t})\mid x_{0}=\bm{x}\bigg{]},

\mbox{\small(SAA-control)}\;\inf_{\pi\in\Pi}\;\mathbb{E}^{\pi}_{w_{t}\sim\nu_{N}}\bigg{[}\sum_{t=0}^{\infty}\alpha^{t}c(x_{t},u_{t})\mid x_{0}=\bm{x}\bigg{]},

ν_{N} := \frac{1}{N} i = 1 \sum N δ_{\overset{w}{^}^{(i)}}

ν_{N} := \frac{1}{N} i = 1 \sum N δ_{\overset{w}{^}^{(i)}}

\begin{split}J_{\bm{x}}(\pi,\gamma):=\mathbb{E}^{\pi,\gamma}\bigg{[}\sum_{t=0}^{\infty}\alpha^{t}c(x_{t},u_{t})\mid x_{0}=\bm{x}\bigg{]},\end{split}

\begin{split}J_{\bm{x}}(\pi,\gamma):=\mathbb{E}^{\pi,\gamma}\bigg{[}\sum_{t=0}^{\infty}\alpha^{t}c(x_{t},u_{t})\mid x_{0}=\bm{x}\bigg{]},\end{split}

∣ c (x, u) ∣ \leq b ξ (x) \forall (x, u) \in K,

∣ c (x, u) ∣ \leq b ξ (x) \forall (x, u) \in K,

γ \in Γ sup J_{x} (π^{⋆}, γ) \leq γ^{'} \in Γ sup J_{x} (π, γ^{'}) \forall π \in Π.

γ \in Γ sup J_{x} (π^{⋆}, γ) \leq γ^{'} \in Γ sup J_{x} (π, γ^{'}) \forall π \in Π.

\mbox (D R - co n t r o l) π \in Π in f γ \in Γ sup J_{x} (π, γ),

\mbox (D R - co n t r o l) π \in Π in f γ \in Γ sup J_{x} (π, γ),

D := {μ \in P (W) ∣ W_{p} (μ, ν_{N}) \leq θ} .

D := {μ \in P (W) ∣ W_{p} (μ, ν_{N}) \leq θ} .

\begin{split}W_{p}({\mu},\nu_{N}):=\min_{\kappa\in\mathcal{P}(\mathcal{W}^{2})}\bigg{\{}&\bigg{[}\int_{\mathcal{W}^{2}}d(w,w^{\prime})^{p}\>\kappa(\mathrm{d}w,\mathrm{d}w^{\prime})\bigg{]}^{\frac{1}{p}}\mid\Pi^{1}\kappa={\mu},\Pi^{2}\kappa=\nu_{N}\bigg{\}},\end{split}

\begin{split}W_{p}({\mu},\nu_{N}):=\min_{\kappa\in\mathcal{P}(\mathcal{W}^{2})}\bigg{\{}&\bigg{[}\int_{\mathcal{W}^{2}}d(w,w^{\prime})^{p}\>\kappa(\mathrm{d}w,\mathrm{d}w^{\prime})\bigg{]}^{\frac{1}{p}}\mid\Pi^{1}\kappa={\mu},\Pi^{2}\kappa=\nu_{N}\bigg{\}},\end{split}

W_{p}(\mu,\nu_{N})^{p}=\sup_{\varphi,\psi\in\Phi}\bigg{[}\int_{\mathcal{W}}\varphi(w)\>\mu(\mathrm{d}w)+\int_{\mathcal{W}}\psi(w^{\prime})\>\nu_{N}(\mathrm{d}w^{\prime})\bigg{]},

W_{p}(\mu,\nu_{N})^{p}=\sup_{\varphi,\psi\in\Phi}\bigg{[}\int_{\mathcal{W}}\varphi(w)\>\mu(\mathrm{d}w)+\int_{\mathcal{W}}\psi(w^{\prime})\>\nu_{N}(\mathrm{d}w^{\prime})\bigg{]},

\begin{split}{\mathcal{D}}&=\bigg{\{}{\mu}\in\mathcal{P}(\mathcal{W})\mid\int_{\mathcal{W}}\varphi(w)\>{\mu}(\mathrm{d}w)\>+\frac{1}{N}\sum_{i=1}^{N}\inf_{w\in\mathcal{W}}[d(w,\hat{w}^{(i)})^{p}-\varphi(w)]\leq\theta^{p}\>\>\forall\varphi\in L^{1}(\mathrm{d}{\mu})\bigg{\}}.\end{split}

\begin{split}{\mathcal{D}}&=\bigg{\{}{\mu}\in\mathcal{P}(\mathcal{W})\mid\int_{\mathcal{W}}\varphi(w)\>{\mu}(\mathrm{d}w)\>+\frac{1}{N}\sum_{i=1}^{N}\inf_{w\in\mathcal{W}}[d(w,\hat{w}^{(i)})^{p}-\varphi(w)]\leq\theta^{p}\>\>\forall\varphi\in L^{1}(\mathrm{d}{\mu})\bigg{\}}.\end{split}

(Tv)(\bm{x}):=\inf_{\bm{u}\in\mathcal{U}(\bm{x})}\sup_{\bm{\mu}\in\mathcal{D}}\bigg{[}c(\bm{x},\bm{u})+\alpha\int_{\mathcal{W}}v(f(\bm{x},\bm{u},w))\bm{\mu}(\mathrm{d}w)\bigg{]}

(Tv)(\bm{x}):=\inf_{\bm{u}\in\mathcal{U}(\bm{x})}\sup_{\bm{\mu}\in\mathcal{D}}\bigg{[}c(\bm{x},\bm{u})+\alpha\int_{\mathcal{W}}v(f(\bm{x},\bm{u},w))\bm{\mu}(\mathrm{d}w)\bigg{]}

∥ v ∥_{ξ} := x \in X sup \frac{∣ v ( x ) ∣}{ξ ( x )} .

∥ v ∥_{ξ} := x \in X sup \frac{∣ v ( x ) ∣}{ξ ( x )} .

∥ T v - T v^{'} ∥_{ξ} \leq τ ∥ v - v^{'} ∥_{ξ} \forall v, v^{'} \in B_{l sc} (X) .

∥ T v - T v^{'} ∥_{ξ} \leq τ ∥ v - v^{'} ∥_{ξ} \forall v, v^{'} \in B_{l sc} (X) .

T v \leq T v^{'} \forall v, v^{'} \in X_{ξ} (X) \mbox s . t . v \leq v^{'} .

T v \leq T v^{'} \forall v, v^{'} \in X_{ξ} (X) \mbox s . t . v \leq v^{'} .

v = T v;

v = T v;

\begin{split}&v^{\star}(\bm{x})=\sup_{\bm{\mu}\in\mathcal{D}}\bigg{[}c(\bm{x},\pi^{\star}(\bm{x}))+\alpha\int_{\mathcal{W}}v^{\star}(f(\bm{x},\pi^{\star}(\bm{x}),w))\>\bm{\mu}(\mathrm{d}w)\bigg{]}\end{split}

\begin{split}&v^{\star}(\bm{x})=\sup_{\bm{\mu}\in\mathcal{D}}\bigg{[}c(\bm{x},\pi^{\star}(\bm{x}))+\alpha\int_{\mathcal{W}}v^{\star}(f(\bm{x},\pi^{\star}(\bm{x}),w))\>\bm{\mu}(\mathrm{d}w)\bigg{]}\end{split}

v^{⋆} (x) = π \in Π in f γ \in Γ sup J_{x} (π, γ) = γ \in Γ sup J_{x} (π^{⋆}, γ) \forall x \in X .

v^{⋆} (x) = π \in Π in f γ \in Γ sup J_{x} (π, γ) = γ \in Γ sup J_{x} (π^{⋆}, γ) \forall x \in X .

\begin{split}(Tv)(\bm{x})=\inf_{\bm{u},\lambda,\ell}\;&\bigg{[}\lambda\theta^{p}+c(\bm{x},\bm{u})+\frac{1}{N}\sum_{i=1}^{N}\ell_{i}\bigg{]}\\ \mbox{s.t.}\;&\alpha v(f(\bm{x},\bm{u},w))-\lambda d(w,\hat{w}^{(i)})^{p}\leq\ell_{i}\;\;\forall w\in\mathcal{W}\\ &\bm{u}\in\mathcal{U}(\bm{x}),\>\lambda\geq 0,\>\ell\in\mathbb{R}^{N}\end{split}

\begin{split}(Tv)(\bm{x})=\inf_{\bm{u},\lambda,\ell}\;&\bigg{[}\lambda\theta^{p}+c(\bm{x},\bm{u})+\frac{1}{N}\sum_{i=1}^{N}\ell_{i}\bigg{]}\\ \mbox{s.t.}\;&\alpha v(f(\bm{x},\bm{u},w))-\lambda d(w,\hat{w}^{(i)})^{p}\leq\ell_{i}\;\;\forall w\in\mathcal{W}\\ &\bm{u}\in\mathcal{U}(\bm{x}),\>\lambda\geq 0,\>\ell\in\mathbb{R}^{N}\end{split}

\begin{split}&(Tv)(\bm{x})=\inf_{\bm{u}\in\mathcal{U}(\bm{x}),\lambda\geq 0}\bigg{[}\lambda\theta^{p}+\int_{\mathcal{W}}\sup_{w\in\mathcal{W}}\big{[}c(\bm{x},\bm{u})+\alpha v(f(\bm{x},\bm{u},w))-\lambda d(w,{w}^{\prime})^{p}\big{]}\nu_{N}(\mathrm{d}w^{\prime})\bigg{]}.\end{split}

\begin{split}&(Tv)(\bm{x})=\inf_{\bm{u}\in\mathcal{U}(\bm{x}),\lambda\geq 0}\bigg{[}\lambda\theta^{p}+\int_{\mathcal{W}}\sup_{w\in\mathcal{W}}\big{[}c(\bm{x},\bm{u})+\alpha v(f(\bm{x},\bm{u},w))-\lambda d(w,{w}^{\prime})^{p}\big{]}\nu_{N}(\mathrm{d}w^{\prime})\bigg{]}.\end{split}

∥ v^{π_{ϵ}} - v^{⋆} ∥_{ξ} < ϵ

∥ v^{π_{ϵ}} - v^{⋆} ∥_{ξ} < ϵ

v^{π} (x) := γ \in Γ sup J_{x} (π, γ) .

v^{π} (x) := γ \in Γ sup J_{x} (π, γ) .

v_{k + 1} (x) := (T v_{k}) (x)

v_{k + 1} (x) := (T v_{k}) (x)

\overset{π}{^} (x) := \hat{u},

\overset{π}{^} (x) := \hat{u},

(T^{\pi}v)(\bm{x}):=\sup_{\bm{\mu}\in\mathcal{D}}\bigg{[}c(\bm{x},\pi(\bm{x}))+\alpha\int_{\mathcal{W}}v(f(\bm{x},\pi(\bm{x}),w))\bm{\mu}(\mathrm{d}w)\bigg{]}

(T^{\pi}v)(\bm{x}):=\sup_{\bm{\mu}\in\mathcal{D}}\bigg{[}c(\bm{x},\pi(\bm{x}))+\alpha\int_{\mathcal{W}}v(f(\bm{x},\pi(\bm{x}),w))\bm{\mu}(\mathrm{d}w)\bigg{]}

∥ T^{π} v - T^{π} v^{'} ∥_{ξ} \leq τ ∥ v - v^{'} ∥_{ξ} \forall v, v^{'} \in B_{ξ} (X),

∥ T^{π} v - T^{π} v^{'} ∥_{ξ} \leq τ ∥ v - v^{'} ∥_{ξ} \forall v, v^{'} \in B_{ξ} (X),

T^{π} v \leq T^{π} v^{'} \forall v, v^{'} \in X_{ξ} (X) \mbox s . t . v \leq v^{'} .

T^{π} v \leq T^{π} v^{'} \forall v, v^{'} \in X_{ξ} (X) \mbox s . t . v \leq v^{'} .

(T^{π} v) (x) - ϵ < c (x, π (x)) + α \int_{W} v (f (x, π (x), w)) \hat{μ} (d w) .

(T^{π} v) (x) - ϵ < c (x, π (x)) + α \int_{W} v (f (x, π (x), w)) \hat{μ} (d w) .

(T^{π} v) (x) - (T^{π} v^{'}) (x) - ϵ < α \int_{W} [v (f (x, π (x), w)) - v^{'} (f (x, π (x), w))] \hat{μ} (d w) \leq α \int_{W} ∥ v - v^{'} ∥_{ξ} ξ (f (x, π (x), w)) \hat{μ} (d w) \leq α ∥ v - v^{'} ∥_{ξ} β ξ (x),

(T^{π} v) (x) - (T^{π} v^{'}) (x) - ϵ < α \int_{W} [v (f (x, π (x), w)) - v^{'} (f (x, π (x), w))] \hat{μ} (d w) \leq α \int_{W} ∥ v - v^{'} ∥_{ξ} ξ (f (x, π (x), w)) \hat{μ} (d w) \leq α ∥ v - v^{'} ∥_{ξ} β ξ (x),

k > \frac{lo g [( 1 - τ ) ^{2} ϵ ] - lo g ( 2 b τ )}{lo g τ},

k > \frac{lo g [( 1 - τ ) ^{2} ϵ ] - lo g ( 2 b τ )}{lo g τ},

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Wasserstein Distributionally Robust Stochastic Control:

A Data-Driven Approach

Insoon Yang Department of Electrical and Computer Engineering, Automation and Systems Research Institute, Seoul National University ([email protected]). Supported in part by NSF under ECCS-1708906 and CNS-1657100, Research Resettlement Fund for the new faculty of Seoul National University (SNU), the Creative-Pioneering Researchers Program through SNU, the Basic Research Lab Program through the National Research Foundation of Korea funded by the MSIT(2018R1A4A1059976), and Samsung Electronics.

Abstract

Standard stochastic control methods assume that the probability distribution of uncertain variables is available. Unfortunately, in practice, obtaining accurate distribution information is a challenging task. To resolve this issue, we investigate the problem of designing a control policy that is robust against errors in the empirical distribution obtained from data. This problem can be formulated as a two-player zero-sum dynamic game problem, where the action space of the adversarial player is a Wasserstein ball centered at the empirical distribution. We propose computationally tractable value and policy iteration algorithms with explicit estimates of the number of iterations required for constructing an $\epsilon$ -optimal policy. We show that the contraction property of associated Bellman operators extends a single-stage out-of-sample performance guarantee, obtained using a measure concentration inequality, to the corresponding multi-stage guarantee without any degradation in the confidence level. In addition, we characterize an explicit form of the optimal distributionally robust control policy and the worst-case distribution policy for linear-quadratic problems with Wasserstein penalty. Our study indicates that dynamic programming and Kantorovich duality play a critical role in solving and analyzing the Wasserstein distributionally robust stochastic control problems.

1 Introduction

The theory of stochastic optimal control is based on the assumption that the probability distribution of uncertain variables (e.g., disturbances) is fully known. However, this assumption is often restrictive in practice, because estimating an accurate distribution requires large-scale high-resolution sensor measurements over a long training period or multiple periods. Situations in which uncertain variables are not directly observed are much more challenging; computational methods, such as filtering or statistical learning techniques, are often used to obtain the (posterior) distribution of the uncertain variables given limited observations. The accuracy of the obtained distribution is often unsatisfactory, as it is subject to the quality of the collected data, computational methods, and prior knowledge regarding the variables. If poor distributional information is employed in constructing a stochastic optimal controller, it does not guarantee optimality and can even cause catastrophic system behaviors (e.g., [1, 2]).

To overcome this issue of limited distribution information in stochastic control, we investigate a distributionally robust control approach. This emerging minimax stochastic control method minimizes a cost function of interest, assuming that the distribution of uncertain variables is not completely known, but is contained in a pre-specified ambiguity set of probability distributions. In this paper, we model the ambiguity set as a statistical ball centered at an empirical distribution with a radius measured by the Wasserstein metric. This modeling approach provides a straightforward means to incorporate data samples into distributionally robust control problems. Our focus is to show that the resulting stochastic control problems have several salient features in terms of computational tractability and out-of-sample performance guarantee.

Due to its superior statistical properties, the Wasserstein ambiguity set has recently received a great deal of attention in distributionally robust optimization (e.g., [3, 4, 5, 6]), learning (e.g., [7, 8]) and filtering [9]. Specifically, the Wasserstein ball contains both continuous and discrete distributions while statistical balls with the $\phi$ -divergence such as the Kullback-Leibler divergence centered at a discrete empirical distribution is not sufficiently rich to contain relevant continuous distributions. Furthermore, the Wasserstein metric addresses the closeness between two points in the support, unlike the $\phi$ -divergence. Due to the incapability of the $\phi$ -divergence in terms of taking into account the distance between two support elements, the associated ambiguity set may contain irrelevant distributions [5]. For these reasons, we chose the Wasserstein metric to handle distribution ambiguity, although several other types of ambiguity sets have been proposed in the context of single-stage optimization by using moment constraints (e.g., [10, 11, 12]), confidence sets (e.g., [13]), and the $\phi$ -divergences (e.g., [14, 15]).

1.1 Related Work

Distributionally robust sequential decision-making problems have been studied in the context of finite Markov decision processes (MDPs) and continuous-state stochastic control. In the finite MDP setting, dynamic programming (DP) approaches have been proposed [16, 17, 18]. In [16], moment-based ambiguity sets are used to impose constraints on the moments of distributions, such as mean and covariance. This approach is further extended to handle more types of constraints, such as confidence sets and mean absolute deviation [17], by using the lifting technique given in [13]. Distributionally robust MDPs with Wasserstein balls are studied in [18], which provides computationally tractable reformulations and useful analytical properties.

Continuous-state distributionally robust control problems can be considered as a class of minimax stochastic control on Borel spaces [19]. In the case of linear dynamics and quadratic cost functions, [20] focuses on linear policies and proposes tractable semidefinite program formulation when moment constraints are imposed. A DP method is also proposed for moment-based ambiguity sets and applied to probabilistic safety specification problems [21]. On the other hand, [22] uses a total variation ball to model distribution ambiguity and proposes a modified version of the classical policy iteration algorithm. Furthermore, a Riccati equation-based approach is also developed in the linear-quadratic regulator setting with the total variation ambiguity set [23] and the relative entropy constraint [24].

1.2 Contributions

Departing from the aforementioned control approaches that indirectly use data samples, we consider continuous-state distributionally robust control problems with Wasserstein ambiguity sets and develop a dynamic programming method to solve and analyze problems by directly using the data. The following is a summary of the main contributions of this work. First, we propose computationally tractable value and policy iteration algorithms with explicit estimates of the number of iterations necessary for obtaining an $\epsilon$ -optimal policy. The original Bellman equation involves an infinite-dimensional minimax optimization problem, where the inner maximization problem is over probability measures in the Wasserstein ball. To alleviate the computational issue without sacrificing optimality, we reformulate Bellman operators by using modern DRO based on Kantorovich duality [3, 5]. Second, we show that the resulting distributionally robust policy $\pi^{\star}$ has a probabilistic out-of-sample performance guarantee by using the contraction property of associated Bellman operators and a measure concentration inequality. In other words, when $\pi^{\star}$ is used, a probabilistic bound holds on the closed-loop performance evaluated under a new set of samples that are selected independently of the training data. We observe that the contraction property of the Bellman operator seamlessly connects a single-stage performance guarantee to its multi-stage counterpart in a manner that is independent of the number of stages. Third, we consider a Wasserstein penalty problem and derive an explicit expression of the optimal control policy and the worst-case distribution policy, along with a Riccati-type equation in the linear-quadratic setting. We also show that the resulting control policy converges to the optimal policy of the corresponding linear-quadratic-Gaussian (LQG) problem as the penalty parameter tends to $+\infty$ . The performance and utility of the proposed method are demonstrated through an investment-consumption problem and a power system frequency control problem.

This paper is significantly extended from its preliminary version [25], which models distribution ambiguity by using confidence sets. Specifically, we consider Wasserstein ambiguity sets and investigate new salient features of the corresponding distributionally robust control framework such as $(i)$ a characterization of the worst-case distribution policy, $(ii)$ an out-of-sample performance guarantee, and $(iii)$ an explicit expression of the solution to linear-quadratic problems.

1.3 Organization

In Section 2, we define optimal distributionally robust policies under ambiguous uncertainty and formulate the corresponding distributionally robust stochastic control problem as a dynamic game. In Section 3, we develop a tractable semi-infinite program formulation of the Bellman equation and characterize one of the worst-case distribution policies by using Kantorovich duality. In Section 4, we examine a probabilistic out-of-sample performance guarantee of the distributionally robust policy. In Section 5, we present the Wasserstein penalty problem and its explicit solution obtained from a Riccati-type solution. Finally, in Section 6, we provide the results of our numerical experiments.

1.4 Notation

Given a Borel space $X$ , we denote $\mathcal{P}(X)$ by the set of Borel probability measures on $X$ . In addition, $\mathbb{B}_{\xi}(X)$ denotes the Banach space of measurable functions $v$ on $X$ with a finite weighted sup-norm, i.e., $\|v\|_{\xi}:=\sup_{\bm{x}\in X}(|v(\bm{x})|/\xi(\bm{x}))<\infty$ given a measurable weight function $\xi:X\to\mathbb{R}$ . Let $\mathbb{B}_{lsc}(X)$ be the set of lower semicontinuous functions in $\mathbb{B}_{\xi}(X)$ .

2 Distributionally Robust Control of Stochastic Systems

2.1 Ambiguity in Stochastic Systems

Consider a discrete-time stochastic system of the form

[TABLE]

where $x_{t}\in\mathcal{X}\subseteq\mathbb{R}^{n}$ and $u_{t}\in\mathcal{U}\subseteq\mathbb{R}^{m}$ denote the system state and control input, respectively. Here, $w_{t}\in\mathcal{W}\subseteq\mathbb{R}^{l}$ is a random disturbance. The probability distribution of $w_{t}$ is denoted by $\mu_{t}$ . However, in practice, the probability distribution is not fully known and is difficult to estimate accurately. We assume that $\mathcal{X}$ , $\mathcal{U}$ and $\mathcal{W}$ are Borel subsets of $\mathbb{R}^{n}$ , $\mathbb{R}^{m}$ and $\mathbb{R}^{l}$ , respectively.

Suppose that $w_{t}$ ’s are i.i.d. and that we have access to the sample $\{\hat{w}^{(1)},\ldots,\hat{w}^{(N)}\}$ of $w_{t}$ . One of the most straightforward approaches is to use the sample average approximation (SAA) method and solve the corresponding optimal control problem with the empirical distribution. This SAA-control problem can be formulated as

[TABLE]

where $\nu_{N}$ denotes the empirical distribution constructed from the $N$ -samples:

[TABLE]

with the Dirac delta measure $\delta_{\hat{w}^{(i)}}$ concentrated at $\hat{w}^{(i)}$ . Here, $\alpha\in(0,1)$ is a discount factor, $c:\mathcal{X}\times\mathcal{U}\to\mathbb{R}$ is a stage-wise cost function of interest, and $\mathbb{E}_{w_{t}\sim\nu_{N}}^{\pi}$ denotes the expected value taken with respect to the probability measure induced by the control policy $\pi$ and the empirical distribution $\nu$ . As the number of samples, $N$ , tends to infinity, the empirical distribution $\nu$ well approximates the true distribution $\mu$ ; thus, an optimal policy of the SAA-control problem presents a near-optimal performance.

Unfortunately, it takes a long simulation period or multiple episodes to obtain a large number of samples. Furthermore, in practice, it is likely that the sample data do not reflect the true distribution due to inaccurate sensor measurements or data corruption by malicious attackers (e.g., hackers). To resolve these issues in data-driven stochastic control, we propose an optimization method to construct a policy that is robust against errors in the empirical distribution (2.3). More specifically, our policy minimizes the worst-case total cost that is calculated under a probability distribution contained in a given set $\mathcal{D}\subset\mathcal{P}(\mathcal{W})$ , which is called the ambiguity set of probability distributions. The ambiguity set can be designed to adequately characterize errors in the empirical distribution.

2.2 Distributionally Robust Policy

To formulate a concrete distributionally robust control problem, we consider a Markov (or stochastic) game with complete information (e.g., [26, 19]), which is a class of two-player zero-sum dynamic games: Player I (controller) determines a policy to minimize the total cost while Player II (adversary) selects the disturbance distribution $\mu_{t}$ of $w_{t}$ from the ambiguity set $\mathcal{D}$ to maximize the same cost value. Let $H_{t}$ be the set of histories up to stage $t$ , whose element is of the form $h_{t}:=(x_{0},u_{0},\cdots,x_{t-1},u_{t-1},x_{t})$ .111All the results in this paper are valid with histories of the form $\tilde{h}_{t}:=(x_{0},u_{0},w_{0},\mu_{0},\cdots,x_{t-1},u_{t-1},w_{t-1},\mu_{t-1},x_{t})$ that also contains Player II’s actions $(\mu_{0},\cdots,\mu_{t-1})$ ; that is because under Assumption 1, without loss of optimality, it suffices to focus on stationary policies that depend only on current state information. We intentionally use the reduced version of histories, as the realized distributions may not be observable in practice. The set of admissible control strategies (for Player I) is given by $\Pi:=\{\pi:=(\pi_{0},\pi_{1},\ldots)\>|\>\pi_{t}(\mathcal{U}(x_{t})|h_{t})=1\;\forall h_{t}\in H_{t}\}$ , where $\pi_{t}$ is a stochastic kernel from $H_{t}$ to $\mathbb{R}^{m}$ and $\mathcal{U}(x_{t})\subseteq\mathcal{U}$ is the set of admissible control actions (given that the system state is $x_{t}$ at stage $t$ ). Similarly, the set of Player II’s admissible strategies is defined by $\Gamma:=\{\gamma:=(\gamma_{0},\gamma_{1},\ldots)\>|\>\gamma_{t}(\mathcal{D}|h_{t}^{e})=1\;\forall h_{t}^{e}\in H_{t}^{e}\}$ , where $H_{t}^{e}$ is the set of extended histories up to stage $t$ , whose element is of the form $h_{t}^{e}:=(x_{0},u_{0},\mu_{0},\cdots,x_{t-1},u_{t-1},\mu_{t-1},x_{t},u_{t})$ and $\gamma_{t}$ is a stochastic kernel from $H_{t}$ to $\mathcal{P}(\mathcal{W})$ . Note that the ambiguity set $\mathcal{D}$ is the action space of Player II. Here, we allow Player II can change the distribution of $w_{t}$ over time. Thus, the strategy space for Player II is larger than necessary, and this gives an advantage to the adversary. However, later we will show that an optimal policy of Player II is stationary under some assumption (see Proposition 5).

We consider the following infinite-horizon discounted cost function:

[TABLE]

where $\mathbb{E}^{\pi,\gamma}$ denotes expectation with respect to the probability measure induced by the strategy pair $(\pi,\gamma)\in\Pi\times\Gamma$ .

Before defining a concrete stochastic control problem, we impose the following standard assumption for measurable selection in semicontinuous models [19]:

Assumption 1.

Let $\mathbb{K}:=\{(\bm{x},\bm{u})\in\mathcal{X}\times\mathcal{U}\mid\bm{u}\in\mathcal{U}(\bm{x})\}$ .

The function $c$ is lower semicontinuous on $\mathbb{K}$ , and

[TABLE]

for some constant $b\geq 0$ and continuous function $\xi:\mathcal{X}\to[1,\infty)$ such that ${\xi}^{\prime}(\bm{x},\bm{u}):=\int_{\mathcal{W}}\xi(f(\bm{x},\bm{u},w))\bm{\mu}(\mathrm{d}w)$ is continuous on $\mathbb{K}$ for any $\bm{\mu}\in\mathcal{D}$ . In addition, there exists a constant $\beta\in[1,1/\alpha)$ such that ${\xi}^{\prime}(\bm{x},\bm{u})\leq\beta\xi(\bm{x})$ for all $(\bm{x},\bm{u})\in\mathbb{K}$ ; 2. 2.

For each continuous bounded function $\chi:\mathcal{X}\to\mathbb{R}$ , the function ${\chi}^{\prime}(\bm{x},\bm{u}):=\int_{\mathcal{W}}\chi(f(\bm{x},\bm{u},w))\bm{\mu}(\mathrm{d}w)$ is continuous on $\mathbb{K}$ for any $\bm{\mu}\in\mathcal{D}$ ; 3. 3.

The set $\mathcal{U}(\bm{x})$ is compact for every $\bm{x}\in\mathcal{X}$ , and the set-valued mapping $\bm{x}\mapsto\mathcal{U}(\bm{x})$ is upper semicontinuous.

The first condition trivially holds when $c$ is bounded. In fact, $\xi$ is a weight function introduced to relax the boundedness assumption. Assumption 1 ensures the existence of an optimal policy $\pi^{\star}$ , which is deterministic and stationary, of a minimax control problem with the cost function (2.6) [19, Theorem 4.1]. Furthermore, the corresponding optimal value function lies in $\mathbb{B}_{lsc}(\mathcal{X})$ as discussed later.

We now define the optimal distributionally robust policies as follows:

Definition 1.

A control policy $\pi^{\star}\in\Pi$ is said to be an optimal distributionally robust policy if it satisfies

[TABLE]

In words, an optimal distributionally robust policy achieves the minimal cost under the most adverse policies that select disturbance distributions in the ambiguity set $\mathcal{D}$ . Such a desirable policy can be obtained by solving the following problem:

[TABLE]

which we call the distributionally robust control (DR-control) problem. The existence of an optimal policy under Assumption 1 will be formalized in Theorem 1 in Section 3.1.

The most important part of this formulation is the inner maximization problem over all disturbance distribution policies in $\Gamma$ , which encodes distributional uncertainty through $\mathcal{D}$ . An optimal policy $\pi^{\star}$ has a performance guarantee in the form of an upper-bound, $\sup_{\gamma\in\Gamma}J_{\bm{x}}(\pi^{\star},\gamma)$ , if the ambiguity set is sufficiently large to contain the true distribution. This performance guarantee may not be valid when a different control policy is used, as shown in (2.5).

2.3 Wasserstein Ambiguity Set

To complete the formulation of the DR-control problem, we consider a specific class of ambiguity sets using the Wasserstein metric. Let $\mathcal{D}$ be a statistical ball centered at the empirical distribution $\nu_{N}$ defined by (2.3) with radius $\theta>0$ :

[TABLE]

Here, the distance between the two probability distributions is measured by the Wasserstein metric of order $p\in[1,\infty)$ ,

[TABLE]

where $d$ is a metric on $\mathcal{W}$ , and $\Pi^{i}\kappa$ denotes the $i$ th marginal of $\kappa$ for $i=1,2$ . The Wasserstein distance between two probability distributions represents the minimum cost of transporting or redistributing mass from one to another via non-uniform perturbation, and the optimization variable $\kappa$ can be interpreted as a transport plan.

The minimization problem to identify an optimal transport plan $\kappa$ in (2.8) is called the Monge-Kantorovich problem. The minimum of this problem can be found by solving the following dual problem:

[TABLE]

where $\Phi:=\{(\varphi,\psi)\in L^{1}(\mathrm{d}\mu)\times L^{1}(\mathrm{d}\nu_{N})\mid\varphi(w)+\psi(w^{\prime})\leq d(w,w^{\prime})^{p}\;\forall w,w^{\prime}\in\mathcal{W}\}$ . This equivalence is known as the Kantorovich duality principle. Then, the Wasserstein ball (2.8) can be expressed as follows:

Lemma 1.

The Wasserstein ambiguity set defined by (2.7) is equivalent to

[TABLE]

A proof for this lemma is contained in Appendix A. Note that the minimization problem in the reformulated Wasserstein ball is finite dimensional, unlike the original Monge-Kantorovich problem. In the following section, we propose computationally tractable value and policy iteration algorithms by using the reformulation results in DRO based on Kantorovich duality.

3 Dynamic Programming Solution and Analysis

Our first goal is to develop a computationally tractable dynamic programming (DP) solution for the DR-control problem (2.6). We begin by characterizing an optimality condition using the Bellman’s principle.

3.1 Bellman’s Principle of Optimality

For any $v\in\mathbb{B}_{\xi}(\mathcal{X})$ , let $T$ be the Bellman operator of the DR-control problem (2.6), defined by

[TABLE]

for every $\bm{x}\in\mathcal{X}$ . Assumption 1 enables us to conduct the contraction analysis with respect to the weighted sup-norm $\|\cdot\|_{\xi}$ defined by

[TABLE]

The second and third conditions in Assumption 1 play a critical role in preserving the lower semicontinuity of the value function when applying the Bellman operator as well as in the existence and optimality of deterministic stationary policies. Let $\Pi^{DS}$ be the set of deterministic stationary policies, i.e., $\Pi^{DS}:=\{\pi:\mathcal{X}\to\mathcal{U}\mid\pi(x_{t})=u_{t}\in\mathcal{U}(x_{t})$ , $\pi$ measurable}. Then, the following lemmas hold:

Lemma 2 (Contraction and Monotonicity).

Suppose that Assumption 1 holds. Then, $Tv\in\mathbb{B}_{lsc}(\mathcal{X})$ for any $v\in\mathbb{B}_{lsc}(\mathcal{X})$ . Furthermore, the Bellman operator $T:\mathbb{B}_{lsc}(\mathcal{X})\to\mathbb{B}_{lsc}(\mathcal{X})$ is a $\tau$ -contraction mapping with respect to $\|\cdot\|_{\xi}$ , where $\tau:=\alpha\beta\in(0,1)$ 222Here, the constant $\beta\in[1,1/\alpha)$ is defined in Assumption 1-1)., i.e.,

[TABLE]

Furthermore, $T$ is monotone, i.e.,

[TABLE]

Lemma 3 (Measurable selection).

Suppose that Assumption 1 holds. There exist a measurable function $v^{\star}\in\mathbb{B}_{lsc}(\mathcal{X})$ and a deterministic stationary policy $\pi^{\star}\in\Pi^{DS}$ such that

$v^{\star}$ * is the unique function in $\mathbb{B}_{lsc}(\mathcal{X})$ that satisfies the following Bellman equation:*

[TABLE] 2. 2.

given any fixed $\bm{x}\in\mathcal{X}$ ,

[TABLE]

and $\lim_{t\to\infty}\alpha^{t}\mathbb{E}^{\pi,\gamma}[v^{\star}(x_{t})]=0$ for all $(\pi,\gamma)\in\Pi\times\Gamma$ .

These lemmas follow immediately from [19, Lemma 4.4 and Theorem 4.1]. In fact, for any $v\in\mathbb{B}_{lsc}(\mathcal{X})$ , there exists $\hat{\bm{u}}\in\mathcal{U}(\bm{x})$ such that $(Tv)(\bm{x})=\sup_{\bm{\mu}\in\mathcal{D}}[c(\bm{x},\hat{\bm{u}})+\alpha\int_{\mathcal{W}}v(f(\bm{x},\hat{\bm{u}},w))\>\bm{\mu}(\mathrm{d}w)]$ for every $\bm{x}\in\mathcal{X}$ under Assumption 1 (see [19, Lemma 3.3]).333Thus, the outer minimization problem in the definition of $T$ admits an optimal solution when $v\in\mathbb{B}_{lsc}(\mathcal{X})$ , and “ $\inf$ ” can be replaced by “ $\min$ .” If we let $\pi^{\star}(\bm{x}):=\hat{\bm{u}}$ for each $\bm{x}\in\mathcal{X}$ , then $\pi^{\star}$ is an optimal distributionally robust policy, which is deterministic and stationary. More specifically, the following principle of optimality holds:

Theorem 1 (Existence and optimality of deterministic stationary policy).

Suppose that Assumption 1 holds. Then, $(v^{\star},\pi^{\star})\in\mathbb{B}_{lsc}(\mathcal{X})\times\Pi^{DS}$ defined in Lemma 3 satisfies

[TABLE]

In words, $v^{\star}$ is the optimal value function of the DR-control problem (2.6), and $\pi^{\star}$ is an optimal policy, which is deterministic and stationary.

The existence and optimality results are shown in a more general minimax control setting in [19, Theorem 4.1].

3.2 Value Iteration

To compute the optimal value function $v^{\star}$ , we first consider a value iteration (VI) approach, $v_{k+1}:=Tv_{k}$ , where $v_{k}$ denotes the value function evaluated at the $k$ th iteration and $v_{0}$ is initialized as an arbitrary function in $\mathbb{B}_{lsc}(\mathcal{X})$ . By the contraction property of $T$ (Lemma 2), the Banach fixed-point theorem implies that $v_{k}$ converges to $v^{\star}$ pointwise as $k$ tends to $\infty$ under Assumption 1. However, this approach requires us to solve the infinite-dimensional minimax optimization problem in the Bellman operator for each $\bm{x}\in\mathcal{X}$ in each iteration. To alleviate this issue, we reformulate the problem into a computationally tractable form by using modern Wasserstein DRO [3, 5].

Proposition 1.

Suppose that the function $w\mapsto v(f(\bm{x},\bm{u},w))$ lies in $L^{1}(\mathrm{d}\nu_{N})$ for each $(\bm{x},\bm{u})\in\mathbb{K}$ . Then, the Bellman operator $T$ can be expressed as

[TABLE]

for each $\bm{x}\in\mathcal{X}$ , where the first inequality constraint holds for all $i=1,\ldots,N$ .

This reformulation can be obtained by using Kantorovich duality on the Wasserstein ambiguity set (Lemma 1). It is shown in [5, Theorem 1] that there is no duality gap.

Note that the reformulated optimization problem in Proposition 1 has finite-dimensional decision variables as $\bm{u}\in\mathcal{U}(\bm{x})\subseteq\mathcal{U}\subseteq\mathbb{R}^{m}$ , $\lambda\in\mathbb{R}$ and $\ell\in\mathbb{R}^{N}$ . However, the first inequality constraint must hold for all $w$ in the support $\mathcal{W}$ , which could be a dense set. Thus, in general, the reformulated problem is a semi-infinite program. This semi-infinite program can be solved by using several existing convergent algorithms, such as discretization, sampling-based methods (see [27, 28, 29, 30] and the references therein).

To interpret this reformulation, we consider the following equivalent integral form:

[TABLE]

The integrand above can be interpreted as a regularized cost-to-go function. The regularized value is then integrated using the empirical distribution $\nu_{N}$ . The first term $\lambda\theta^{p}$ , which is nonnegative, is added to compensate for this regularization effect and the optimism induced by the empirical distribution so that the reformulated optimization problem is consistent with the original one.

We define an $\epsilon$ -optimal policy of (2.6) as $\pi_{\epsilon}\in\Pi$ that satisfies

[TABLE]

for $\epsilon>0$ , where $v^{\pi}:\mathcal{X}\to\mathbb{R}$ is the (worst-case) value function of a policy $\pi\in\Pi$ , i.e.,

[TABLE]

The following VI algorithm can be used to find an $\epsilon$ -optimal policy:

Initialize $v_{0}$ as an arbitrary function in $\mathbb{B}_{lsc}(\mathcal{X})$ , and set $k:=0$ ; 2. 2.

For each $\bm{x}\in\mathcal{X}$ , compute

[TABLE]

by solving the semi-infinite program (3.2) with $v:=v_{k}$ ; 3. 3.

If the stopping criterion is met, then go to Step 4); Otherwise, set $k\leftarrow k+1$ and go to Step 2); 4. 4.

For each $\bm{x}\in\mathcal{X}$ , set

[TABLE]

where $\hat{\bm{u}}$ is an optimal $\bm{u}$ of the semi-infinite program (3.2) that computes $(Tv_{k})(\bm{x})$ , and stop.

Note that the existence of an optimal $\hat{\bm{u}}$ in Step 4) is guaranteed under Assumption 1 by [19, Lemma 3.3]. A typical stopping criterion in VI is $\|v_{k+1}-v_{k}\|_{\xi}<\delta$ for some threshold $\delta>0$ . However, we can even compute the number of iterations required to achieve the desired precision $\epsilon>0$ . Given any $\pi\in\Pi^{DS}$ and $v\in\mathbb{B}_{\xi}(\mathcal{X})$ , let

[TABLE]

for all $\bm{x}\in\mathcal{X}$ . The Bellman operator $T^{\pi}$ has the following properties:

Lemma 4.

Suppose that Assumption 1 holds. Then, given any $\pi\in\Pi^{DS}$ , we have $T^{\pi}v\in\mathbb{B}_{\xi}(\mathcal{X})$ for any $v\in\mathbb{B}_{\xi}(\mathcal{X})$ . Furthermore, the operator $T^{\pi}:\mathbb{B}_{\xi}(\mathcal{X})\to\mathbb{B}_{\xi}(\mathcal{X})$ is a $\tau$ -contraction mapping with respect to $\|\cdot\|_{\xi}$ , i.e.,

[TABLE]

where $\tau:=\alpha\beta\in(0,1)$ . Furthermore, $T^{\pi}$ is monotone, i.e.,

[TABLE]

Proof.

By Assumption 1, it is clear that $T^{\pi}v\in\mathbb{B}_{\xi}(\mathcal{X})$ if $v\in\mathbb{B}_{\xi}(\mathcal{X})$ . Fix arbitrary $v,v^{\prime}\in\mathbb{B}_{\xi}(\mathcal{X})$ , and an arbitrary $\bm{x}\in\mathcal{X}$ . For any $\epsilon>0$ , there exists $\hat{\bm{\mu}}\in\mathcal{D}$ such that

[TABLE]

Thus, we have

[TABLE]

where the last inequality holds due to Assumption 1-1). By switching the role of $v$ and $v^{\prime}$ , we also have $(T^{\pi}v^{\prime})(\bm{x})-(T^{\pi}v)(\bm{x})-\epsilon\leq\alpha\beta\|v-v^{\prime}\|_{\xi}\xi(\bm{x})$ . Since the two inequalities hold for any $\bm{x}\in\mathcal{X}$ and $\epsilon>0$ , and $\tau=\alpha\beta$ , we conclude that $\|T^{\pi}v-T^{\pi}v^{\prime}\|_{\xi}\leq\tau\|v-v^{\prime}\|_{\xi}$ . It is straightforward to check that $T^{\pi}$ is monotone. ∎

This lemma implies that the value function $v^{\pi}$ is the unique fixed point of $T^{\pi}$ in $\mathbb{B}_{\xi}(\mathcal{X})$ . By using the contraction property of $T^{\pi}$ and $T$ , we can estimate the number of iterations needed to obtain an $\epsilon$ -optimal policy as follows:

Proposition 2.

Suppose that Assumption 1 holds. We assume that given $\epsilon>0$ , the total number of iterations, $k$ , in the VI algorithm satisfies

[TABLE]

where $b\geq 0$ and $\tau\in(0,1)$ are the constants defined in Assumption 1 and Lemma 4, respectively. Then, $\hat{\pi}$ obtained by the VI algorithm is an $\epsilon$ -optimal policy, i.e.,

[TABLE]

Proof.

By Lemma 4 and Theorem 1, we have $v^{\hat{\pi}},v_{k},v^{\star}\in\mathbb{B}_{\xi}(\mathcal{X})$ . We observe that

[TABLE]

where the last inequality holds because of Lemma 4, $T^{\hat{\pi}}v_{k}=Tv_{k}$ and $v^{\star}=Tv^{\star}$ . By Lemma 2, we have

[TABLE]

On the other hand, by [19, Theorem 4.2 (a)],

[TABLE]

where the second inequality holds due to the proposed choice of $k$ . Combining (3.4) and (3.5), we conclude that $\|v^{\hat{\pi}}-v^{\star}\|_{\xi}<\epsilon$ . ∎

A practical implementation of the VI algorithm requires a finite-state approximation such as a discretization of the state space. A review on such approximation methods can be found in a recent monograph [31].

3.3 Policy Iteration

Policy iteration (PI) is an alternative way to construct an $\epsilon$ -optimal policy. The PI algorithm can be described as follows:

Initialize $\pi_{0}$ as an arbitrary policy in $\Pi^{DS}$ , and set $k:=0$ ; 2. 2.

(Policy evaluation) Find the fixed point $v^{\pi_{k}}$ of $T^{\pi_{k}}$ ; 3. 3.

(Policy improvement) For each $\bm{x}\in\mathcal{X}$ , set

[TABLE]

where $\tilde{\bm{u}}$ is an optimal $\bm{u}$ of the semi-infinite program (3.2) that computes $(Tv^{\pi_{k}})(\bm{x})$ ; 4. 4.

If the stopping criterion is met, then stop and set $\tilde{\pi}:=\pi_{k+1}$ . Otherwise, set $k\leftarrow k+1$ and go to Step 2);

Here, the stopping criterion can be chosen as $\|v^{\pi_{k}}-v^{\pi_{k-1}}\|_{\xi}<\delta$ for a positive constant $\delta$ . To perform the policy evaluation step (Step 2) in a computationally tractable manner, we reformulate the infinite-dimensional maximization problem in the definition of $T^{\pi}$ as finite dimensional by using Wasserstein DRO [3, 5].

Proposition 3.

Suppose that Assumption 1 holds and that $v\in\mathbb{B}_{\xi}(\mathcal{X})$ . Then, the operator $T^{\pi}:\mathbb{B}_{\xi}(\mathcal{X})\to\mathbb{B}_{\xi}(\mathcal{X})$ satisfies

[TABLE]

*where ${B}:=\big{\{}(\underline{w}^{(1)},\ldots,\underline{w}^{(N)},\overline{w}^{(1)},\ldots,\overline{w}^{(N)})\in\mathcal{W}^{2N},q\in\Delta\mid\frac{1}{N}\sum_{i=1}^{N}[q_{1}d(\underline{w}^{(i)},\hat{w}^{(i)})^{p}+q_{2}d(\overline{w}^{(i)},\hat{w}^{(i)})^{p}]\leq\theta^{p}\big{\}}$ . *

This proposition follows immediately from [5, Corollary 2]. The optimization variables $\underline{w}^{(1)},\ldots,\underline{w}^{(N)}$ , $\overline{w}^{(1)},\ldots,\overline{w}^{(N)}$ can be interpreted as the probability atoms that characterize one of the worst-case distributions. By the contraction property of $T^{\pi_{k}}$ (Lemma 4), we can find the fixed point $v^{\pi_{k}}$ of $T^{\pi_{k}}$ by value iteration. In other words, we perform $v_{\tau+1}\leftarrow T^{\pi_{k}}v_{\tau}$ , $\tau=0,1,\ldots$ , until convergence. When computing $T^{\pi_{k}}v_{\tau}$ , we solve the finite-dimensional optimization problem in Proposition 3 with $v:=v_{\tau}$ to completely remove the infinite-dimensionality issue inherent in the definition of $T^{\pi_{k}}$ . In the policy improvement step, we use the semi-infinite program formulation of $T$ in Proposition 1 instead of directly solving the infinite-dimensional minimax optimization problem in the definition of $T$ . It is well known that $\lim_{k\to\infty}\|v^{\pi_{k}}-v^{\star}\|_{\xi}=0$ under Assumption 1 by the monotonicity and contraction properties of $T$ and $T^{\pi_{k}}$ (Lemmas 2 and 4) [32, Proposition 2.5.4].

However, it is usually difficult to find the exact fixed point $v^{\pi_{k}}$ of $T^{\pi_{k}}$ in the policy evaluation step. Thus, we propose a modified PI algorithm, which is also called optimistic policy iteration [33, 32]:

Initialize $\tilde{v}_{0}$ as an arbitrary function in $\mathbb{B}_{lsc}(\mathcal{X})$ and $\{M_{k}\}$ as a sequence of positive integers, and set $k:=1$ ; 2. 2.

(Policy improvement) For each $\bm{x}\in\mathcal{X}$ , set

[TABLE]

where $\tilde{\bm{u}}$ is an optimal $\bm{u}$ of the semi-infinite program (3.2) that computes $(T\tilde{v}_{k-1})(\bm{x})$ ; 3. 3.

(Policy evaluation) Compute

[TABLE]

by solving the finite-dimensional optimization problems in Proposition 3; 4. 4.

If the stopping criterion is met, then stop and set $\tilde{\pi}:=\pi_{k}$ . Otherwise, set $k\leftarrow k+1$ and go to Step 2);

Note that the modified PI algorithm approximately evaluates the performance of a policy $\pi_{k}$ as $\tilde{v}_{k}$ instead of finding the exact fixed point of $T^{\pi_{k}}$ . Concrete choices of the order sequence $\{M_{k}\}$ are discussed in [34]. However, for any choice of $\{M_{k}\}$ , the modified PI algorithm converges under Assumption 1 [32]:

[TABLE]

As in the case of VI, we can estimate the number of iterations required for obtaining an $\epsilon$ -optimal policy.

Proposition 4.

Suppose that Assumption 1 holds. Let $r\in\mathbb{R}$ be a positive constant such that

[TABLE]

We assume that given $\epsilon>0$ , the total number of iterations, $k$ , in the modified PI algorithm satisfies

[TABLE]

where $\tau\in(0,1)$ is the constant defined in Lemma 4. Then, $\tilde{\pi}:=\pi_{k}$ obtained by the modified PI algorithm is an $\epsilon$ -optimal policy, i.e.,

[TABLE]

Proof.

According to Lemma 4 and Theorem 1, we have $v^{\tilde{\pi}},\tilde{v}_{k},v^{\star}\in\mathbb{B}_{\xi}(\mathcal{X})$ . By [32, Lemma 2.5.4], we obtain that

[TABLE]

which implies that

[TABLE]

On the other hand, $\tilde{\pi}=\pi_{k}$ is a greedy policy when the value function is chosen as $\tilde{v}_{k-1}$ . As in the proof of Proposition 2, we have $\|v^{\tilde{\pi}}-v^{\star}\|_{\xi}\leq\frac{2\tau}{1-\tau}\|\tilde{v}_{k-1}-v^{\star}\|_{\xi}$ . Thus, by (3.6),

[TABLE]

where the second inequality holds due to the proposed choice of $k$ . ∎

3.4 The Worst-Case Distribution Policy

Given a policy $\pi\in\Pi^{DS}$ (for Player I), the worst-case distribution policy (for Player II) can be found by solving

[TABLE]

which is an optimal control problem. By the dynamic programming principle, the worst-case value function $v^{\pi}$ , defined by (3.3), is the unique solution to the following Bellman equation:

[TABLE]

under Assumption 1. The worst-case value function $v^{\pi}$ can be computed, for example, via value iteration. Given $v^{\pi}$ , how can we characterize the worst-case distribution policy? The following proposition indicates that, if the optimization problem involved in $(T^{\pi}v^{\pi})(\bm{x})$ admits an optimal solution for all $\bm{x}\in\mathcal{X}$ , then there exists an optimal policy for Player II, which is deterministic and stationary, and it generates a finitely-supported worst-case distribution.

Proposition 5 (Worst-case distribution policy).

Suppose that Assumption 1 holds, and that given $\pi\in\Pi^{DS}$

[TABLE]

admits an optimal solution for any $\bm{x}\in\mathcal{X}$ . Then, the deterministic stationary policy $\gamma^{\pi}:\mathcal{X}\to\mathcal{D}$ defined by

[TABLE]

is an optimal policy (for Player II) that generates a worst-case distribution for each state $\bm{x}\in\mathcal{X}$ , where $w_{\bm{x}}^{\pi}:=(\underline{w}_{\bm{x}}^{\pi,(1)},\ldots,\underline{w}_{\bm{x}}^{\pi,(N)},\overline{w}_{\bm{x}}^{\pi,(1)},\ldots,\overline{w}_{\bm{x}}^{\pi,(N)})$ is an optimal solution of the maximization problem in Proposition 3 with $v:=v^{\pi}$ .

The existence of an optimal policy, which is deterministic and stationary, follows from the dynamic programming principle when the assumptions in the proposition hold. Thus, it is sufficient for Player II to use the same worst-case distribution for all stages. The structure of $\gamma^{\pi}(\bm{x})$ is obtained by applying [5, Corollary 1] to the maximization problem in the proposition. Note that the worst-case distribution of this form is consistent with the discussion below Proposition 3. By using [5, Corollary 2], we have the following sharper result of characterizing the worst-case distribution with $N+1$ atoms: if the assumptions in Proposition 5 hold, one of the worst-case distribution policies has the form

[TABLE]

where $i_{0}\in\{1,\ldots,N\}$ , $p_{0}\in[0,1]$ , $\underline{w}_{\bm{x}}^{\pi,(i_{0})},\overline{w}_{\bm{x}}^{\pi,(i_{0})}\in\operatorname*{arg\,min}_{w\in\mathcal{W}}\{\lambda^{\star}d(w,\hat{w}^{(i_{0})})^{p}-\alpha v(f(\bm{x},\pi(\bm{x}),w))\}$ , and ${w}_{\bm{x}}^{\pi,(i)}\in\operatorname*{arg\,min}_{w\in\mathcal{W}}\{\lambda^{\star}d(w,\hat{w}^{(i)})^{p}-\alpha v(f(\bm{x},\pi(\bm{x}),w))\}$ for all $i\neq i_{0}$ . Here, $\lambda^{\star}$ is a dual minimizer, which must exist when the worst-case distribution exists [5, Corollary 1].

It is worth mentioning that Kantorovich duality and DP play a critical role in obtaining all the results in this section. Based on the reformulation results and analytical properties of DR-control problems, we demonstrate their utility in the following sections.

4 Out-of-Sample Performance Guarantee

A potential defect of the SAA-control formulation (2.2) is that its optimal policy may not perform well if a testing dataset of $w_{t}$ is different from the training dataset $\{\hat{w}^{(1)},\ldots,\hat{w}^{(N)}\}$ . This issue occurs even when the testing and training datasets are sampled from the same distribution. Such a degradation of the optimal decisions in out-of-sample tests is often called the optimizer’s curse in the literature of decision analysis [35]. We show that an optimal distributionally robust policy can alleviate this issue and provide a guaranteed out-of-sample performance if the radius $\theta$ of Wasserstein ambiguity set is carefully determined.

Let ${\pi}_{\hat{w}}^{\star}\in\Pi$ denote an optimal distributionally robust policy obtained by using the training dataset $\hat{w}:=\{\hat{w}^{(1)},\ldots,\hat{w}^{(N)}\}$ of $N$ samples. The out-of-sample performance of $\pi^{\star}$ is measured as

[TABLE]

which represents the expected total cost under a new sample that is generated (according to $\mu$ ) independent of the training dataset. Unfortunately, the out-of-sample performance cannot be precisely computed because the true distribution $\mu$ is unknown. Thus, instead, we aim at establishing a probabilistic out-of-sample performance guarantee of the form:

[TABLE]

where $v^{\star}_{\hat{w}}$ denotes the optimal value function of the DR-control problem with the training dataset $\hat{w}:=\{\hat{w}^{(1)},\ldots,\hat{w}^{(N)}\}$ , and $\beta\in(0,1)$ .444Here, $\hat{w}$ , $\pi_{\hat{w}}^{\star}$ and $v_{\hat{w}}^{\star}$ are viewed as random objects. The inequality represents a bound $(1-\beta)$ on the probability that the expected cost incurred by $\pi^{\star}$ is no greater than the optimal value function. Note that the probability and the expected cost are evaluated with respect to the true distribution $\mu$ . Thus, this inequality provides a probabilistic bound on the performance of $\pi^{\star}$ evaluated with unseen test samples drawn from $\mu$ . Here, $v_{\hat{w}}^{\star}$ , which depends on $\theta$ , plays the role of a certificate for the out-of-sample performance.

Our goal is to identify conditions on the radius $\theta$ under which an optimal distributionally robust policy provides the probabilistic performance guarantee. We begin by imposing the following assumption on the true distribution $\mu$ :

Assumption 2 (Light tail).

There exists a positive constant $q>p$ such that

[TABLE]

This assumption implies that the tail of $\mu$ decays exponentially. Under this condition, the following measure concentration inequality holds:

Theorem 2 (Measure concentration, Theorem 2 in [36]).

Suppose that Assumption 2 holds. Let

[TABLE]

Then,

[TABLE]

where

[TABLE]

and

[TABLE]

Here, $c_{1},c_{2}$ are positive constants depending only on $l$ , $q$ and $\rho$ .

This theorem provides an upper-bound of the probability that the true distribution $\mu$ lies outside of the Wasserstein ambiguity set $\mathcal{D}$ . The measure concentration inequality provides a systematic means to determine the radius for $\mathcal{D}$ to contain the true distribution $\mu$ with probability no less than $(1-\beta)$ . As shown in the following theorem, the contraction property of Bellman operators enables us to extend the single-stage out-of-performance guarantee to its multi-stage counterpart with no additional requirement on $\theta$ .

Theorem 3 (Out-of-sample performance guarantee).

Suppose that Assumptions 1 and 2 hold. Let $\pi_{\hat{w}}^{\star}$ and $v_{\hat{w}}^{\star}$ denote an optimal policy and the optimal value function of the DR-control problem (2.6) with the training dataset $\hat{w}:=\{\hat{w}^{(1)},\ldots,\hat{w}^{(N)}\}$ and the following Wasserstein ball radius:555This choice includes the radius proposed in [3] in the single-stage setting as a special case (when $p=1$ and $l\neq 2$ ).

[TABLE]

where $\bar{\theta}$ satisfies $\frac{\bar{\theta}}{\log(2+1/\bar{\theta})}=[\frac{1}{Nc_{2}}\log(\frac{c_{1}}{\beta})]^{{1/2}}$ , and $c_{1},c_{2}$ are the positive constants in Theorem 2.666The constants $c_{1}$ and $c_{2}$ in Theorem 2 can be calculated using the proof of Theorem 2 in [36]. However, this calculation is often conservative and thus results in a smaller radius $\theta(N,\beta)$ than necessary. Bootstrapping and cross-validation methods can be used to reduce the conservativeness in the a priori bound $\theta(N,\beta)$ , as advocated and demonstrated in [3]. Then, the probabilistic out-of-sample performance guarantee (4.2) holds.

Proof.

Using Theorem 2, we can confirm that our choice of $\theta$ provides the following probabilistic guarantee:

[TABLE]

Define an operator $T^{\star}:\mathbb{B}_{\xi}(\mathcal{X})\to\mathbb{B}_{\xi}(\mathcal{X})$ as $(T^{\star}v)(\bm{x}):=\mathbb{E}_{\mu}[c(\bm{x},\pi_{\hat{w}}^{\star}(\bm{x}))+\alpha v(f(\bm{x},\pi_{\hat{w}}^{\star}(\bm{x}),w))]$ for all $\bm{x}\in\mathcal{X}$ . It follows from (4.3) that the following single-stage guarantee holds:

[TABLE]

given any fixed $\bm{x}\in\mathcal{X}$ . It is straightforward to check under Assumption 1 that $T^{\star}$ is a monotone contraction mapping.

We now show that if $\mu\in\mathcal{D}$ , then $(T^{\star})^{k}{v_{\hat{w}}^{\star}}\leq{v_{\hat{w}}^{\star}}$ for any $k=1,2,\ldots$ using mathematical induction. For $k=1$ , we have $T^{\star}{v_{\hat{w}}^{\star}}\leq T{v_{\hat{w}}^{\star}}={v_{\hat{w}}^{\star}}$ by the minimax definition of $T$ . Suppose now that the induction hypothesis holds for some $k$ . By the monotonicity of $T^{\star}$ and the definition of $T$ , we have

[TABLE]

and thus the induction hypothesis is valid for $k+1$ .

We now notice that

[TABLE]

since $T^{\star}$ is a contraction mapping under Assumption 1. Therefore, if $\mu\in\mathcal{D}$ , then

[TABLE]

By (4.3), the probabilistic performance guarantee holds as desired. ∎

Remark 1.

Note that the contraction property of $T$ and $T^{\star}$ plays a critical role in connecting the single-stage performance guarantee (4.4) to the multi-stage guarantee (4.2) in a way that is independent of the number of stages. This is a quite powerful result, because if we have a radius $\theta$ that provides a desirable confidence level $(1-\beta)$ in the single-stage guarantee, we can use the same radius to achieve the same level of confidence in the multi-stage guarantee with no additional requirement.

5 Wasserstein Penalty Problem

We now consider a slightly different version of the DR-control problem, which can be considered as a relaxation of (2.6) with a fixed penalty parameter $\lambda>0$ :

[TABLE]

where the strategy space $\Gamma^{\prime}:=\{\gamma:=(\gamma_{0},\gamma_{1},\ldots)\>|$ $\gamma_{t}(\mathcal{P}(\mathcal{W})|h_{t}^{e})=1\;\forall h_{t}^{e}\in H_{t}^{e}\}$ of Player II no longer depends on a Wasserstein ambiguity set. Instead of using an explicit ambiguity set $\mathcal{D}$ , Player II is penalized by $\lambda W_{p}(\mu_{t},\nu_{N})^{p}$ , which can be interpreted as the cost of perturbing the empirical distribution $\nu_{N}$ .

5.1 Dynamic Programming

Under Assumption 1, the Bellman operator ${T}^{\prime}_{\lambda}:\mathbb{B}_{\xi}(\mathcal{X})\to\mathbb{B}_{\xi}(\mathcal{X})$ of the Wasserstein penalty problem is defined by

[TABLE]

for all $\bm{x}\in\mathcal{X}$ . By using the strong duality result [5, Theorem 1], we have the following equivalent form of $T_{\lambda}^{\prime}$ :

Proposition 6.

Suppose that the function $w\mapsto v(f(\bm{x},\bm{u},w))$ lies in $L^{1}(\mathrm{d}\nu_{N})$ for each $(\bm{x},\bm{u})\in\mathbb{K}$ . Then, the Bellman operator ${T}^{\prime}_{\lambda}$ can be expressed as

[TABLE]

for all $\bm{x}\in\mathcal{X}$ . Furthermore, we have

[TABLE]

By the results of [19] in the general minimax control setting, the optimal value function $v^{\prime}$ is the unique fixed point (in $\mathbb{B}_{lsc}(\mathcal{X})$ ) of $T_{\lambda}^{\prime}$ under Assumption 1 because $T_{\lambda}^{\prime}$ is a contraction. We can use value iteration to evaluate $v^{\prime}$ due to the Banach fixed point theorem. Analogous to Theorem 1, there exists a deterministic stationary policy $\pi^{\prime}$ , which is optimal, where $\pi^{\prime}(\bm{x})\in\operatorname*{arg\,min}_{\bm{u}\in\mathcal{U}(\bm{x})}[c(\bm{x},\bm{u})+\frac{1}{N}\sum_{i=1}^{N}\sup_{w^{\prime}\in\mathcal{W}}[\alpha v^{\prime}(f(\bm{x},\bm{u},w^{\prime}))-\lambda d(\hat{w}^{(i)},w^{\prime})^{p}]]$ for all $\bm{x}\in\mathcal{X}$ , under Assumption 1.

5.2 Linear-Quadratic Problem

We now develop a solution approach, using a Riccati-type equation, to linear-quadratic (LQ) problems with the Wasserstein penalty when

[TABLE]

where $\|\cdot\|$ denotes the Euclidean norm on $\mathbb{R}^{l}$ . Consider a linear system of the form

[TABLE]

where $A\in\mathbb{R}^{n\times n}$ , $B\in\mathbb{R}^{n\times m}$ , and $\Xi\in\mathbb{R}^{n\times l}$ . We also choose the following quadratic stage-wise cost function:

[TABLE]

where $Q=Q^{\top}\in\mathbb{R}^{n\times n}$ is positive semidefinite, and $R=R^{\top}\in\mathbb{R}^{m\times m}$ is positive definite. For the sake of simplicity, we assume that $\mathbb{E}_{w\sim\nu_{N}}[w]=\frac{1}{N}\sum_{i=1}^{N}\hat{w}^{(i)}=0$ . The case of non-zero mean is considered in Appendix B. Let $\Sigma:=\mathbb{E}_{w\sim\nu_{N}}[ww^{\top}]=\frac{1}{N}\sum_{i=1}^{N}\hat{w}^{(i)}(\hat{w}^{(i)})^{\top}$ . In the LQ setting, we also set $\mathcal{X}:=\mathbb{R}^{n}$ , $\mathcal{U}(\bm{x})\equiv\mathcal{U}:=\mathbb{R}^{m}$ , and $\mathcal{W}:=\mathbb{R}^{l}$ . Note that, unlike the standard LQG, the LQ problems with Wasserstein penalty do not assume that the probability distribution of random disturbances is Gaussian. In fact, the main motivation of this distributionally robust LQ formulation is to relax the assumption of Gaussian disturbance distributions in LQG, and to obtain a useful control policy when the true distribution deviates from a Gaussian distribution.

By using DP, we obtain the following explicit solution of the LQ problem:

Theorem 4.

Suppose that there exists a symmetric positive semidefinite matrix $P\in\mathbb{R}^{n\times n}$ that solves the following equation:

[TABLE]

with

[TABLE]

for a sufficiently large $\lambda$ . Then, ${v}^{\prime}(\bm{x}):=\bm{x}^{\top}P\bm{x}+z$ solves the Bellman equation, where $z:=\frac{\lambda}{1-\alpha}\mbox{tr}[\{\lambda(\lambda I-\alpha\Xi^{\top}P\Xi)^{-1}-I\}\Sigma]$ . If, in addition, $v^{\prime}$ is the optimal value function,777Sufficient conditions for $v^{\prime}$ to be the optimal value function are provided in [37]. Under the stabilizability and observability conditions, the algebraic Riccati equation has a unique positive semidefinite solution as well. then the unique optimal policy ${\pi^{\prime}}$ is given by

[TABLE]

where

[TABLE]

Furthermore, if we let

[TABLE]

the deterministic stationary policy $\gamma^{\prime}\in\Gamma^{\prime}$ , defined as

[TABLE]

is an optimal policy for Player II that generates a worst-case distribution for each $\bm{x}\in\mathbb{R}^{n}$ .

Its proof is contained in Appendix B. We first note that an optimal distributionally robust policy is linear in the system state. Furthermore, the control gain matrix $K$ is independent of the covariance matrix $\Sigma$ as in standard LQG. The worst-case distribution’s support elements $w_{\bm{x}}^{\prime(i)}$ ’s are affine in the system state. More specifically, $w_{\bm{x}}^{\prime(i)}$ is obtained by scaling the $i$ th data sample $\hat{w}^{(i)}\in\mathbb{R}^{l}$ by the factor of $(\lambda I-\alpha\Xi^{\top}P\Xi)^{-1}\lambda$ and shifting it by the vector $(\lambda I-\alpha\Xi^{\top}P\Xi)^{-1}\alpha\Xi^{\top}P(A+BK)\bm{x}$ , which is linear in the system state. Distributional robustness is controlled by the penalty parameter $\lambda$ : As $\lambda$ increases, the permissible deviation of $\mu_{t}$ from $\nu_{N}$ decreases. This is equivalent to decreasing the Wasserstein ball radius $\theta$ in the original DR-control setting. Thus, by letting $\lambda$ tend to $+\infty$ , the optimal distributionally robust policy for the LQ problem converges pointwise to the standard LQ optimal control policy.

Proposition 7.

Suppose that $(A,B)$ is stabilizable and $(A,C)$ is observable, where $Q=C^{\top}C$ . Let $\bar{P}$ be the unique symmetric positive definite solution of the following discrete algebraic Riccati equation:

[TABLE]

and let

[TABLE]

Then, for each $\bm{x}\in\mathcal{X}$

[TABLE]

as $\lambda\to\infty$ , where $\pi^{\prime}$ and $w_{\bm{x}}^{\prime}$ are defined in Theorem 4.

Proof.

Let $P_{\lambda}$ denote a symmetric positive semidefinite solution of (5.3) given any fixed $\lambda\geq\bar{\lambda}$ . As $\lambda$ tends to $+\infty$ , the right-hand side of (5.3) tends to $Q+\alpha A^{\top}P_{\lambda}A-\alpha^{2}A^{\top}P_{\lambda}B(R+\alpha B^{\top}P_{\lambda}B)^{-1}B^{\top}P_{\lambda}A$ , which corresponds to the right-hand side of (5.4) with $\bar{P}=P_{\lambda}$ . Therefore, $P_{\lambda}$ solves the algebraic Riccati equation (5.4) as $\lambda\to\infty$ . On the other hand, (5.4) admits a unique positive definite solution when $(A,C)$ is observable and $(A,B)$ is stabilizable (e.g., [38, Section 2.4]). Thus, $P_{\lambda}$ converges to $\bar{P}$ as $\lambda\to\infty$ . Likewise, we can show that the feedback gain matrix $K$ and the worst-case distribution’s support element $w_{\bm{x}}^{\prime(i)}$ (defined in Theorem 4) tend to $\bar{K}$ and $\hat{w}^{(i)}$ , respectively, as $\lambda\to\infty$ . Therefore, the result follows. ∎

6 Numerical Experiments

6.1 Investment-Consumption Problem

We first demonstrate the performance and utility of DR-control through an investment-consumption problem (e.g., [39, 40]). Let $x_{t}$ be the wealth of an investor at stage $t$ . The investor wishes to decide the amount $u_{1,t}$ to be invested in a risky asset (with an i.i.d. random rate of return, $w_{t}$ ) and the amount $u_{2,t}$ to be consumed at stage $t$ . The remaining amount $(x_{t}-u_{1,t}-u_{2,t})$ is automatically re-invested into a riskless asset with a deterministic rate of return, $\eta$ . Then, the investor’s wealth evolves as

[TABLE]

We assume that the control actions $u_{1,t}$ and $u_{2,t}$ satisfy the following constraints:

[TABLE]

i.e., $\mathcal{U}(\bm{x}):=\{\bm{u}:=(\bm{u}_{1},\bm{u}_{2})\in\mathbb{R}^{2}\mid\bm{u}_{1}+\bm{u}_{2}\leq\bm{x},\bm{u}\geq 0\}$ .

The cost function is given by the following negative expected utility from consumption:

[TABLE]

where the utility function $U:\mathbb{R}\to\mathbb{R}$ is selected as $U(c)=c-\zeta c^{2}$ . The following parameters are used in the numerical simulations: $\zeta=0.25$ , $\alpha=0.9$ , $\eta=1.02$ , and $p=1$ . The data samples $\{\hat{w}^{(1)},\ldots,\hat{w}^{(N)}\}$ of $w_{t}$ are generated according to the normal distribution $\mathcal{N}(1.08,0.1^{2})$ . We numerically approximate the optimal value function $v^{\star}_{\hat{w}}$ and the corresponding optimal policy $\pi^{\star}_{\hat{w}}$ on a computational grid by using the convex optimization approach in [41]. This method approximates the Bellman operator by the optimal value of a convex program with a uniform convergence property. Furthermore, it does not require any explicit interpolation in evaluating the value function and control policies at some state other than the grid points, by using an auxiliary optimization variable to assign the contribution of each grid point to the next state.

The numerical experiments were conducted on a Mac with 4.2 GHz Intel Core i7 and 64GB RAM. The amount of time required for simulations with different grid sizes and $N=10$ are reported in TABLE 1. For the rest of the simulations, we used 71 states (with grid spacing 0.02).

6.1.1 Out-of-sample performance guarantee

To demonstrate the out-of-sample performance guarantee of an optimal distributionally robust policy, we compute the following reliability of $\pi_{\hat{w}}^{\star}$ :

[TABLE]

which represents the probability that the expected cost incurred by $\pi_{\hat{w}}^{\star}$ under the true distribution $\mu$ is no greater than $v_{\hat{w}}^{\star}(\bm{x})$ . As shown in Fig. 1 (a), the reliability increases with the Wasserstein ball radius $\theta$ and the number $N$ of samples. This result is consistent with Theorem 3. Our numerical experiments also confirm that the same radius $\theta$ can be used to achieve the same level of reliability in both single-stage and multi-stage settings as indicated in the theorem.

Fig. 1 (b) illustrates the out-of-sample cost (4.1) of $\pi_{\hat{w}}^{\star}$ with respect to $\theta$ and $N$ . Interestingly, the out-of-sample cost does not monotonically decrease with $\theta$ .888This observation is consistent with the single-stage case in Section 7.2 of [3]. For a too-small radius, the resulting DR-policy is not sufficiently robust to obtain the best out-of-sample performance (i.e., the least out-of-sample cost). On the other hand, if a too-large Wasserstein ambiguity set is selected, the resulting DR-policy is overly conservative and thus sacrifices the closed-loop performance. Thus, there exists an optimal radius (e.g., $0.02$ in the case of $N=20$ ) that provides the best out-of-sample performance.

6.1.2 Comparison to SAA

To compare DR-control (2.6) with SAA-control (2.2), we first compute the out-of-sample performance of $\pi_{\hat{w}}^{\star}$ and that of the corresponding optimal SAA policy $\pi_{\hat{w}}^{\tiny\mbox{SAA}}$ obtained by using the same training dataset $\hat{w}$ . The radius is selected as the one that provides the best out-of-sample performance. As shown in Fig. 2, the proposed DR-policy achieves 8% lower out-of-sample cost than the SAA-policy when $N=10$ . As expected, the gap between the two decreases with the number of samples. Note that the proposed DR-policy designed even with a small number of samples ( $N=10$ ) maintains its performance under the test dataset that is generated independent of the training dataset, unlike the corresponding SAA-policy.

6.2 Power System Frequency Control Problem

Consider an electric power transmission system with $N$ buses (and $\bar{n}$ generator buses). This system may be subject to ambiguous uncertainty generated from variable renewable energy sources such as wind and solar. For the frequency regulation of this system, we use the proposed Wasserstein penalty method to control the mechanical power input of generator. Let $\bm{\theta}_{i}$ and $P_{e,i}$ be the voltage angle (in radian) and the mechanical power input (in per unit), respectively, at generator bus $i$ . The swing equation of this system is then given by

[TABLE]

where $M_{i}$ and $D_{i}$ denote the inertia coefficient (in pu $\cdot$ sec2/rad) and the damping coefficient (in pu $\cdot$ sec/rad) of the generator at bus $i$ . Here, $P_{e,i}$ is the electrical active power injection (in per unit) at bus $i$ and is given by $P_{e,i}:=\sum_{j=1}^{N}|V_{i}||V_{j}|(G_{ij}\cos({\bm{\theta}}_{i}-{\bm{\theta}}_{j})+B_{ij}\sin({\bm{\theta}}_{i}-{\bm{\theta}}_{j}))$ , where $G_{ij}$ and $B_{ij}$ are the conductance and susceptance of the transmission line connecting buses $i$ and $j$ , respectively, and $V_{i}$ is the voltage at bus $i$ . Assuming that all the voltage magnitudes are $1$ per unit, the angle differences $|\bm{\theta}_{i}-\bm{\theta}_{j}|$ ’s are small, and all the transmission lines are (almost) lossless, the AC power flow equation can be approximated by the following linearized DC power flow equation:

[TABLE]

where $P_{e}:=(P_{e,1},\ldots,P_{e,\bar{n}})$ , $\bm{\theta}:=(\bm{\theta}_{1},\ldots,\bm{\theta}_{\bar{n}})$ , and $L\in\mathbb{R}^{\bar{n}\times\bar{n}}$ is the Kron-reduced Laplacian matrix of this power network.999The Kron reduction is used to express the system in the reduced dimension $\bar{n}$ by focusing on the interactions of the generator buses [42]. More precisely, we can obtain the Kron-reduced admittance matrix $Y^{\mbox{\tiny Kron}}$ , by eliminating nongenerator bus $k$ , as $Y_{ij}^{\mbox{\tiny Kron}}:=Y_{ij}-Y_{ik}Y_{kj}/Y_{kk}$ for all $i,j=1,\ldots,N$ such that $i,j\neq k$ . The Kron-reduced Laplacian can then be obtained by setting $L_{ii}:=\sum_{k=1,\ldots,\bar{n}:k\neq i}B_{ik}^{\mbox{\tiny Kron}}$ and $L_{ij}:=-B_{ij}^{\mbox{\tiny Kron}}$ for $i\neq j$ , where $B^{\mbox{\tiny Kron}}$ denotes the susceptance of the Kron-reduced admittance matrix [43].

Let $x(t):=(\bm{\theta}(t)^{\top},\dot{\bm{\theta}}(t)^{\top})^{\top}\in\mathbb{R}^{2\bar{n}}$ and $u(t):=P_{m}(t)\in\mathbb{R}^{\bar{n}}$ . By combining (6.1) and (6.2), we obtain the following state-space model of the power system (e.g., [44]):

[TABLE]

where $M:=\mbox{diag}(M_{1},\ldots,M_{\bar{n}})$ and $D:=\mbox{diag}(D_{1},\ldots,D_{\bar{n}})$ . We discretize this system using zero-order hold on the input and a sampling time of $0.1$ seconds to obtain the matrices $A$ and $B$ of the following discrete-time system model (5.1):

[TABLE]

where $w_{i,t}$ is the random disturbance (in per unit) at bus $i$ at stage $t$ . It can model uncertain power injections generated by solar or wind energy sources.

The state-dependent portion of the quadratic cost function (5.2) is chosen as

[TABLE]

where $\mathbb{1}$ denotes the $\bar{n}$ -dimensional vector of all ones, the first term measures the deviation of rotor angles from their average $\bar{\bm{\theta}}:=\mathbb{1}^{\top}\bm{\theta}/\bar{n}$ , and the second term corresponds to the kinetic energy stored in the electro-mechanical generators [45]. The matrix $R$ is chosen to be the $\bar{n}$ by $\bar{n}$ identity matrix.

The IEEE 39-bus New England test case (with 10 generator buses, 29 load buses, and 40 transmission lines) is used to demonstrate the performance of the proposed LQ control $\pi_{\hat{w}}^{\prime}$ with Wasserstein penalty. The initial values of voltage angles $\bm{\theta}(0)$ are determined by solving the (steady-state) power flow problem using MATPOWER [46]. The initial frequency is set to be zero for all buses except bus 1 at which $\dot{\bm{\theta}}_{1}(0):=0.1$ per unit. We use $\alpha=0.9$ in all simulations.

6.2.1 Worst-case distribution policy

We first compare the standard LQG control policy $\pi_{\hat{w}}^{\mathrm{\tiny LQG}}$ and the proposed DR-control policy $\pi_{\hat{w}}^{\prime}$ with the Wasserstein penalty under the worst-case distribution policy $\gamma_{\hat{w}}^{\prime}$ obtained by using the proof of Theorem 4. We set $N=10$ and $\lambda=0.03$ . The i.i.d. samples $\{\hat{w}^{(i)}\}_{i=1}^{N}$ are generated according to the normal distribution $\mathcal{N}(0,0.1^{2}I)$ . As depicted in Fig. 3,101010The central bar on each box indicates the median; the bottom and top edges of the box indicate the 25th and 75th percentiles, respectively; and the ‘+’ symbol represents the outliers. $\pi_{\hat{w}}^{\prime}$ is less sensitive than $\pi_{\hat{w}}^{\mathrm{\tiny LQG}}$ against the worst-case distribution policy.111111The frequency deviation at other buses displays a similar behavior. In the $[0,24]$ (seconds) interval, the frequency controlled by $\pi_{\hat{w}}^{\mathrm{\tiny LQG}}$ fluctuates around non-zero values while $\pi_{\hat{w}}^{\prime}$ maintains the frequency fluctuation centered approximately around zero. This is because the proposed DR-method takes into account the possibility of nonzero-mean disturbances, while the standard LQG method assumes zero-mean disturbances. Furthermore, the proposed DR-method suppress the frequency fluctuation much faster than the standard LQG method: Under $\pi_{\hat{w}}^{\prime}$ , the mean frequency deviation averaging across the buses is less than 1% for any time after 16.7 seconds. On the other hand, if the standard LQG control is used, it takes 41.8 seconds to take the mean frequency deviation (averaging across the buses) below 1%. The detailed results for each bus are reported in Table 2.

6.2.2 Out-of-sample performance guarantee

We now examine the out-of-sample performance of $\pi_{\hat{w}}^{\prime}$ and how it depends on the penalty parameter $\lambda$ and the number $N$ of samples. The i.i.d. samples $\{\hat{w}^{(i)}\}_{i=1}^{N}$ are generated according to the normal distribution $\mathcal{N}(0,I)$ . Given $\lambda$ and $N$ , we define the reliability of $\pi_{\hat{w}}^{\prime}$ as

[TABLE]

As shown in Fig. 4, the reliability decreases with $\lambda$ . This is because when using larger $\lambda$ , the control policy $\pi_{\hat{w}}^{\prime}$ becomes less robust against the deviation of the empirical distribution from the true distribution. Increasing $\lambda$ has the effect of decreasing the radius $\theta$ in DR-control. In addition, the reliability tends to increase as the number $N$ of samples used to design $\pi_{\hat{w}}^{\prime}$ increases. This result is consistent with the dependency of the DR-control reliability on the number of samples. By using this result, we can determine the penalty parameter to attain a desired out-of-sample performance guarantee (or reliability), given the number of samples.

7 Conclusions

In this paper, we considered distributionally robust stochastic control problems with Wasserstein ambiguity sets by directly using the data samples of uncertain variables. We showed that the proposed framework has several salient features, including $(i)$ computational tractability with error bounds, $(ii)$ an out-of-sample performance guarantee, and $(iii)$ an explicit solution in the LQ setting. It is worth emphasizing that the Kantorovich duality principle plays a critical role in our DP solution and analysis. Furthermore, with regard to the out-of-sample performance guarantee, our analysis provides the unique insight that the contraction property of the Bellman operators extends a single-stage guarantee—obtained using a measure concentration inequality—to the corresponding multi-stage guarantee without any degradation in the confidence level.

Appendix A Proof of Lemma 1

Proof.

Recall that using the Kantorovich duality principle, the Wasserstein distance between ${\mu}$ and $\nu$ can be written as

[TABLE]

where $\Phi:=\{(\varphi,\psi)\in L^{1}(\mathrm{d}\mu)\times L^{1}(\mathrm{d}\nu_{N})\mid\varphi(w)+\psi(w^{\prime})\leq d(w,w^{\prime})^{p}\;\forall w,w^{\prime}\in\mathcal{W}\}$ . Let

[TABLE]

We claim that $\hat{\mathcal{D}}=\mathcal{D}$ . Choose an arbitrary ${\mu}$ from $\hat{\mathcal{D}}$ . Note that for any $(\varphi,\psi)\in\Phi_{d}$ ,

[TABLE]

Thus, we have

[TABLE]

where the last inequality holds becase ${\mu}\in\hat{\mathcal{D}}$ . Therefore, ${\mu}\in\mathcal{D}$ , which implies that $\hat{\mathcal{D}}\subseteq\mathcal{D}$ .

We now select an arbitrary ${\mu}$ from $\mathcal{D}$ . Fix $\varphi\in L^{1}(\mathrm{d}\bm{\mu})$ and define a function $\hat{\psi}:\mathcal{W}\to\mathbb{R}$ by

[TABLE]

Then, $\hat{\psi}\in L^{1}(\mathrm{d}{\mu})$ and $(\varphi,\hat{\psi})\in\Phi$ . Thus,

[TABLE]

which holds for any $\varphi\in L^{1}(\mathrm{d}{\mu})$ . By the definition of $\hat{\psi}$ , this implies that ${\mu}\in\hat{\mathcal{D}}$ . Therefore, $\mathcal{D}\subseteq\hat{\mathcal{D}}$ . ∎

Appendix B Linear-Quadratic Problems

Proof of Theorem 4.

Let $v^{\prime}:\mathbb{R}^{n}\to\mathbb{R}$ be defined as $v^{\prime}(\bm{x}):=\bm{x}^{\top}P\bm{x}+z$ . To compute ${T}_{\lambda}^{\prime}v^{\prime}$ , we first calculate the inner maximization part in Proposition 6 as follows:

[TABLE]

There exists a constant $\bar{\lambda}>0$ (depending on $P$ ) such that for any $\lambda\geq\bar{\lambda}$ , the objective function of the maximization problem above is strictly concave in $w^{\prime}$ (i.e., $\lambda I-\alpha\Xi^{\top}P\Xi$ is positive definite), and thus the unique maximizer is given by

[TABLE]

With this maximizer, we can rewrite the term $\phi(\bm{u},w)$ as

[TABLE]

Since $\mathbb{E}_{w\sim\nu_{N}}[w]=0$ and $\mathbb{E}_{w\sim\nu_{N}}[ww^{\top}]=\Sigma$ , we have

[TABLE]

Recall that

[TABLE]

We notice $R+\alpha B^{\top}PB+\alpha^{2}B^{\top}P\Xi(\lambda I-\alpha\Xi^{\top}P\Xi)^{-1}\Xi^{\top}PB$ is positive definite for $\lambda\geq\bar{\lambda}$ because $R$ is positive definite and $\lambda I-\alpha\Xi^{\top}P\Xi$ is positive definite for $\lambda\geq\bar{\lambda}$ . Thus, the objective function in (B.2) is strictly convex in $\bm{u}$ and has the unique minimizer $\bm{u}^{\star}=K\bm{x}$ . Therefore, we obtain that

[TABLE]

We conclude that $v^{\prime}$ solves the Bellman equation since $P$ and $z$ satisfy $P=Q+\alpha A^{\top}PA+\alpha^{2}A^{\top}SA$ and $(1-\alpha)z=\lambda\mbox{tr}[\{\lambda(\lambda I-\alpha\Xi^{\top}P\Xi)^{-1}-I\}\Sigma]$ . Furthermore, when $v^{\prime}$ is the optimal value function, the value of an optimal policy ${\pi}^{\prime}$ at $\bm{x}\in\mathbb{R}^{n}$ is uniquely given by $\bm{u}^{\star}$ , i.e., ${\pi}^{\prime}(\bm{x})=K\bm{x}$ .

We now characterize the worst-case distribution policy. Plugging $w=\hat{w}^{(i)}$ and $\bm{u}=K\bm{x}$ into (B.1), we obtain that

[TABLE]

Let $\gamma^{\prime}(\bm{x}):=\frac{1}{N}\sum_{i=1}^{N}\delta_{{w}^{\prime(i)}_{\bm{x}}}$ for all $\bm{x}\in\mathcal{X}$ . Then,

[TABLE]

Therefore, we have

[TABLE]

where the last equality holds by the definition of $w^{\prime(i)}_{\bm{x}}$ ’s. On the other hand, it follows from Proposition 6 that

[TABLE]

Thus, we conclude that $\gamma^{\prime}(\bm{x})$ is one of the worst-case distributions. ∎

We now consider the case in which the data samples $\hat{w}^{(i)}$ ’s have non-zero mean, i.e.,

[TABLE]

The linear system (5.1) can be rewritten as

[TABLE]

where $w_{t}^{\prime}:=w_{t}-\bar{w}$ . We now normalize the data samples $\hat{w}^{\prime(i)}:=\hat{w}^{(i)}-\bar{w}$ for all $i\in\mathcal{I}$ so that

[TABLE]

Let $\bar{x}:=(I-A)^{-1}\Xi\bar{w}$ assuming it is well-defined. Then,

[TABLE]

By letting $x_{t}^{\prime}:=((x_{t+1}-\bar{x})^{\top},1)^{\top}\in\mathbb{R}^{n+1}$ , we can rewrite the system as

[TABLE]

Define a positive semidefinite matrix $Q^{\prime}\in\mathbb{R}^{(n+1)\times(n+1)}$ by

[TABLE]

We then have

[TABLE]

Thus, the nonzero mean case is converted to the zero mean case with the normalized data $\hat{w}^{\prime(i)}$ ’s, the expanded state $x_{t}^{\prime}$ and the new positive semidefinite matrix $Q^{\prime}$ in the quadratic cost function. Therefore, we can use Theorem 4 to compute the DR-control gain matrix $K^{\prime}$ . The corresponding optimal policy is obtained as $\pi^{\prime}(\bm{x}):=K^{\prime}((\bm{x}-\bar{x})^{\top},1)^{\top}$ for all $\bm{x}\in\mathbb{R}^{n}$ .

Bibliography46

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] A. Nilim and L. El Ghaoui, “Robust control of Markov decision processes with uncertain transition matrices,” Oper. Res. , vol. 53, no. 5, pp. 780–798, 2005.
2[2] S. Samuelson and I. Yang, “Data-driven distributionally robust control of energy storage to manage wind power fluctuations,” in Proceedings of the 1st IEEE Conference on Control Technology and Applications , 2017.
3[3] P. Mohajerin Esfahani and D. Kuhn, “Data-driven distributionally robust optimization using the Wasserstein metric: Performance guarantees and tractable reformulations,” Math. Program. , vol. 171, no. 1–2, pp. 115–166, 2018.
4[4] C. Zhao and Y. Guan, “Data-driven risk-averse stochastic optimization with Wasserstein metric,” Oper. Res. Lett. , vol. 46, no. 2, 2018.
5[5] R. Gao and A. J. Kleywegt, “Distributionally robust stochastic optimization with Wasserstein distance,” ar Xiv:1604.02199 , 2016.
6[6] J. Blanchet, K. Murthy, and F. Zhang, “Optimal transport based distributionally robust optimization: Structural properties and iterative schemes,” ar Xiv:1810.02403 , 2018.
7[7] A. Sinha, H. Namkoong, and J. Duchi, “Certifying some distributional robustness with principled adversarial training,” in International Conference on Learning Representations , 2018.
8[8] R. Chen and I. C. Paschalidis, “A robust learning approach for regression models based on distributionally robust optimization,” Journal of Machine Learning Research , pp. 1–48, 2018.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Wasserstein Distributionally Robust Stochastic Control:

Abstract

1 Introduction

1.1 Related Work

1.2 Contributions

1.3 Organization

1.4 Notation

2 Distributionally Robust Control of Stochastic Systems

2.1 Ambiguity in Stochastic Systems

2.2 Distributionally Robust Policy

Assumption 1**.**

Definition 1**.**

2.3 Wasserstein Ambiguity Set

Lemma 1**.**

3 Dynamic Programming Solution and Analysis

3.1 Bellman’s Principle of Optimality

Lemma 2** (Contraction and Monotonicity).**

Lemma 3** (Measurable selection).**

Theorem 1** (Existence and optimality of deterministic stationary policy).**

3.2 Value Iteration

Proposition 1**.**

Lemma 4**.**

Proof.

Proposition 2**.**

Proof.

3.3 Policy Iteration

Proposition 3**.**

Proposition 4**.**

Proof.

3.4 The Worst-Case Distribution Policy

Proposition 5** (Worst-case distribution policy).**

4 Out-of-Sample Performance Guarantee

Assumption 2** (Light tail).**

Theorem 2** (Measure concentration, Theorem 2 in [36]).**

Theorem 3** (Out-of-sample performance guarantee).**

Proof.

Remark 1**.**

5 Wasserstein Penalty Problem

5.1 Dynamic Programming

Proposition 6**.**

5.2 Linear-Quadratic Problem

Theorem 4**.**

Proposition 7**.**

Proof.

6 Numerical Experiments

6.1 Investment-Consumption Problem

6.1.1 Out-of-sample performance guarantee

6.1.2 Comparison to SAA

6.2 Power System Frequency Control Problem

6.2.1 Worst-case distribution policy

6.2.2 Out-of-sample performance guarantee

7 Conclusions

Appendix A Proof of Lemma 1

Proof.

Appendix B Linear-Quadratic Problems

Proof of Theorem 4.

Assumption 1.

Definition 1.

Lemma 1.

Lemma 2 (Contraction and Monotonicity).

Lemma 3 (Measurable selection).

Theorem 1 (Existence and optimality of deterministic stationary policy).

Proposition 1.

Lemma 4.

Proposition 2.

Proposition 3.

Proposition 4.

Proposition 5 (Worst-case distribution policy).

Assumption 2 (Light tail).

Theorem 2 (Measure concentration, Theorem 2 in [36]).

Theorem 3 (Out-of-sample performance guarantee).

Remark 1.

Proposition 6.

Theorem 4.

Proposition 7.