Distributional Method for Risk Averse Reinforcement Learning

Ziteng Cheng; Sebastian Jaimungal; Nick Martin

arXiv:2302.14109·cs.LG·March 1, 2023

Distributional Method for Risk Averse Reinforcement Learning

Ziteng Cheng, Sebastian Jaimungal, Nick Martin

PDF

Open Access

TL;DR

This paper presents a distributional reinforcement learning approach for risk-averse policies in Markov decision processes, leveraging neural networks to efficiently handle randomized policies and avoid the curse of dimensionality.

Contribution

It introduces a novel distributional method for risk-averse reinforcement learning that effectively incorporates randomized policies and exploits problem structure to mitigate dimensionality issues.

Findings

01

The proposed method successfully avoids the curse of dimensionality.

02

Neural network approximation effectively models the value distribution.

03

The approach performs well across various randomly chosen model parameters.

Abstract

We introduce a distributional method for learning the optimal policy in risk averse Markov decision process with finite state action spaces, latent costs, and stationary dynamics. We assume sequential observations of states, actions, and costs and assess the performance of a policy using dynamic risk measures constructed from nested Kusuoka-type conditional risk mappings. For such performance criteria, randomized policies may outperform deterministic policies, therefore, the candidate policies lie in the d-dimensional simplex where d is the cardinality of the action space. Existing risk averse reinforcement learning methods seldom concern randomized policies, na\"ive extensions to current setting suffer from the curse of dimensionality. By exploiting certain structures embedded in the corresponding dynamic programming principle, we propose a distributional learning method for seeking…

Equations90

ρ_{t, T} (Z) := {ρ_{t} (Z_{t} + γ ρ_{t + 1, T} (Z)), ρ_{T} (Z_{T}), t < T, t = T,

ρ_{t, T} (Z) := {ρ_{t} (Z_{t} + γ ρ_{t + 1, T} (Z)), ρ_{T} (Z_{T}), t < T, t = T,

ρ_{0, \infty} (Z) := T \to \infty lim ρ_{0, T} (Z),

\rho^{\pi}_{t}(Z):=\sup_{\mu\in\mathcal{M}}\bigg{\{}\int_{0}^{1}\inf_{q\in\mathbb{R}}\bigg{\{}q+\\ \xi^{-1}\int_{\mathbb{R}}(z-q)_{+}\,P^{Z|\mathscr{U}^{\pi}_{t}}(\operatorname{d\!}z)\bigg{\}}\,\mu(\operatorname{d\!}\xi)\bigg{\}},

\rho^{\pi}_{t}(Z):=\sup_{\mu\in\mathcal{M}}\bigg{\{}\int_{0}^{1}\inf_{q\in\mathbb{R}}\bigg{\{}q+\\ \xi^{-1}\int_{\mathbb{R}}(z-q)_{+}\,P^{Z|\mathscr{U}^{\pi}_{t}}(\operatorname{d\!}z)\bigg{\}}\,\mu(\operatorname{d\!}\xi)\bigg{\}},

q \in R in f {q + ξ^{- 1} \int_{R} (z - q)_{+} P^{Z ∣ U_{t}^{π}} (d z)}

q \in R in f {q + ξ^{- 1} \int_{R} (z - q)_{+} P^{Z ∣ U_{t}^{π}} (d z)}

π in f ρ_{0, \infty}^{π} ((C (X_{t}^{π}, A_{t}^{π}, X_{t + 1}^{π}))_{t \in N}),

π in f ρ_{0, \infty}^{π} ((C (X_{t}^{π}, A_{t}^{π}, X_{t + 1}^{π}))_{t \in N}),

Sv(i):=\inf_{\lambda\in\mathcal{P}(\mathbb{A})}\sup_{\mu\in\mathcal{M}}\bigg{\{}\int_{0}^{1}\inf_{q\in\mathbb{R}}\bigg{\{}q+\\ \xi^{-1}\sum_{k\in\mathbb{A}}\lambda_{k}\sum_{j\in\mathbb{X}}T^{k}_{ij}\Big{(}C(i,k,j)+\gamma v(j)-q\Big{)}_{+}\bigg{\}}\mu(\operatorname{d\!}\xi)\bigg{\}}.

Sv(i):=\inf_{\lambda\in\mathcal{P}(\mathbb{A})}\sup_{\mu\in\mathcal{M}}\bigg{\{}\int_{0}^{1}\inf_{q\in\mathbb{R}}\bigg{\{}q+\\ \xi^{-1}\sum_{k\in\mathbb{A}}\lambda_{k}\sum_{j\in\mathbb{X}}T^{k}_{ij}\Big{(}C(i,k,j)+\gamma v(j)-q\Big{)}_{+}\bigg{\}}\mu(\operatorname{d\!}\xi)\bigg{\}}.

Q(i,\lambda):=\sup_{\mu\in\mathcal{M}}\int_{[0,1]}\inf_{q\in\mathbb{R}}\bigg{\{}q+\xi^{-1}\sum_{k\in\mathbb{A}}\lambda_{k}\\ \sum_{j\in\mathbb{X}}T^{k}_{ij}\,\Big{(}C(i,k,j)+\gamma v^{*}(j)-q\Big{)}_{+}\bigg{\}}\,\mu(\operatorname{d\!}\xi),

Q(i,\lambda):=\sup_{\mu\in\mathcal{M}}\int_{[0,1]}\inf_{q\in\mathbb{R}}\bigg{\{}q+\xi^{-1}\sum_{k\in\mathbb{A}}\lambda_{k}\\ \sum_{j\in\mathbb{X}}T^{k}_{ij}\,\Big{(}C(i,k,j)+\gamma v^{*}(j)-q\Big{)}_{+}\bigg{\}}\,\mu(\operatorname{d\!}\xi),

Q(i,\lambda)=\sup_{\mu\in\mathcal{M}}\int_{[0,1]}\inf_{q\in\mathbb{R}}\bigg{\{}q+\xi^{-1}\sum_{k\in\mathbb{A}}\lambda_{k}x\\ \sum_{j\in\mathbb{X}}\,\Big{(}C(i,k,j)+\gamma\inf_{\lambda\in\mathcal{P}(\mathbb{A})}Q(j,\lambda)-q\Big{)}_{+}\bigg{\}}\,\mu(\operatorname{d\!}\xi).

Q(i,\lambda)=\sup_{\mu\in\mathcal{M}}\int_{[0,1]}\inf_{q\in\mathbb{R}}\bigg{\{}q+\xi^{-1}\sum_{k\in\mathbb{A}}\lambda_{k}x\\ \sum_{j\in\mathbb{X}}\,\Big{(}C(i,k,j)+\gamma\inf_{\lambda\in\mathcal{P}(\mathbb{A})}Q(j,\lambda)-q\Big{)}_{+}\bigg{\}}\,\mu(\operatorname{d\!}\xi).

\displaystyle g(i,\lambda,q):=\sum_{k\in\mathbb{A}}\lambda_{k}\sum_{j\in\mathbb{X}}T^{k}_{ij}\,\big{(}C(i,k,j)+\gamma v^{*}(j)-q\big{)}_{+}.

\displaystyle g(i,\lambda,q):=\sum_{k\in\mathbb{A}}\lambda_{k}\sum_{j\in\mathbb{X}}T^{k}_{ij}\,\big{(}C(i,k,j)+\gamma v^{*}(j)-q\big{)}_{+}.

v^{*}(i)=\inf_{\lambda\in\mathcal{P}(\mathbb{A})}\sup_{\mu\in\mathcal{M}}\\ \bigg{\{}\int_{(0,1]}\inf_{q\in[0,\frac{c_{\text{max}}}{1-\gamma}]}\bigg{\{}q+\xi^{-1}g(i,\lambda,q)\bigg{\}}\mu(\operatorname{d\!}\xi)\bigg{\}}.

v^{*}(i)=\inf_{\lambda\in\mathcal{P}(\mathbb{A})}\sup_{\mu\in\mathcal{M}}\\ \bigg{\{}\int_{(0,1]}\inf_{q\in[0,\frac{c_{\text{max}}}{1-\gamma}]}\bigg{\{}q+\xi^{-1}g(i,\lambda,q)\bigg{\}}\mu(\operatorname{d\!}\xi)\bigg{\}}.

\displaystyle\begin{cases}\begin{multlined}\hat{\theta}_{n+1}\in\operatorname*{arg\,min}_{\theta\in\Theta}\sup_{(i,k,q)\in\mathbb{X}\times\mathbb{A}\times[0,\frac{c_{\text{max}}}{1-\gamma}]}\\[-11.99998pt] \bigg{(}f_{\theta}(i,k,q)-\sum_{j\in\mathbb{X}}\hat{T}^{k}_{ij}\big{(}C(i,k,j)+\gamma\hat{v}_{n}(j)-q\big{)}_{+}\bigg{)}^{2},\end{multlined}\hat{\theta}_{n+1}\in\operatorname*{arg\,min}_{\theta\in\Theta}\sup_{(i,k,q)\in\mathbb{X}\times\mathbb{A}\times[0,\frac{c_{\text{max}}}{1-\gamma}]}\\[-11.99998pt] \bigg{(}f_{\theta}(i,k,q)-\sum_{j\in\mathbb{X}}\hat{T}^{k}_{ij}\big{(}C(i,k,j)+\gamma\hat{v}_{n}(j)-q\big{)}_{+}\bigg{)}^{2},\\ \begin{multlined}\hat{v}_{n+1}(i)=\inf_{\lambda\in\mathcal{P}(\mathbb{A})}\sup_{\mu\in\mathcal{M}}\int_{(0,1]}\\[-11.99998pt] \inf_{q\in[0,\frac{c_{\text{max}}}{1-\gamma}]}\bigg{\{}q+v\xi^{-1}\sum_{k\in\mathbb{A}}\lambda_{k}f_{\hat{\theta}_{n+1}}(i,k,q)\bigg{\}}\mu(\operatorname{d\!}\xi),\end{multlined}\hat{v}_{n+1}(i)=\inf_{\lambda\in\mathcal{P}(\mathbb{A})}\sup_{\mu\in\mathcal{M}}\int_{(0,1]}\\[-11.99998pt] \inf_{q\in[0,\frac{c_{\text{max}}}{1-\gamma}]}\bigg{\{}q+v\xi^{-1}\sum_{k\in\mathbb{A}}\lambda_{k}f_{\hat{\theta}_{n+1}}(i,k,q)\bigg{\}}\mu(\operatorname{d\!}\xi),\end{cases}

\displaystyle\begin{cases}\begin{multlined}\hat{\theta}_{n+1}\in\operatorname*{arg\,min}_{\theta\in\Theta}\sup_{(i,k,q)\in\mathbb{X}\times\mathbb{A}\times[0,\frac{c_{\text{max}}}{1-\gamma}]}\\[-11.99998pt] \bigg{(}f_{\theta}(i,k,q)-\sum_{j\in\mathbb{X}}\hat{T}^{k}_{ij}\big{(}C(i,k,j)+\gamma\hat{v}_{n}(j)-q\big{)}_{+}\bigg{)}^{2},\end{multlined}\hat{\theta}_{n+1}\in\operatorname*{arg\,min}_{\theta\in\Theta}\sup_{(i,k,q)\in\mathbb{X}\times\mathbb{A}\times[0,\frac{c_{\text{max}}}{1-\gamma}]}\\[-11.99998pt] \bigg{(}f_{\theta}(i,k,q)-\sum_{j\in\mathbb{X}}\hat{T}^{k}_{ij}\big{(}C(i,k,j)+\gamma\hat{v}_{n}(j)-q\big{)}_{+}\bigg{)}^{2},\\ \begin{multlined}\hat{v}_{n+1}(i)=\inf_{\lambda\in\mathcal{P}(\mathbb{A})}\sup_{\mu\in\mathcal{M}}\int_{(0,1]}\\[-11.99998pt] \inf_{q\in[0,\frac{c_{\text{max}}}{1-\gamma}]}\bigg{\{}q+v\xi^{-1}\sum_{k\in\mathbb{A}}\lambda_{k}f_{\hat{\theta}_{n+1}}(i,k,q)\bigg{\}}\mu(\operatorname{d\!}\xi),\end{multlined}\hat{v}_{n+1}(i)=\inf_{\lambda\in\mathcal{P}(\mathbb{A})}\sup_{\mu\in\mathcal{M}}\int_{(0,1]}\\[-11.99998pt] \inf_{q\in[0,\frac{c_{\text{max}}}{1-\gamma}]}\bigg{\{}q+v\xi^{-1}\sum_{k\in\mathbb{A}}\lambda_{k}f_{\hat{\theta}_{n+1}}(i,k,q)\bigg{\}}\mu(\operatorname{d\!}\xi),\end{cases}

\hat{T}_{ij}^{k} := \frac{\sum _{t = 1}^{t_{max} - 1} \mathbbm 1 _{(i, k, j)} ( x _{t} , a _{t} , x _{t + 1} )}{\sum _{t = 1}^{t_{m a x} - 1} \mathbbm 1 _{(i, k)} ( x _{t} , a _{t} )} .

\hat{T}_{ij}^{k} := \frac{\sum _{t = 1}^{t_{max} - 1} \mathbbm 1 _{(i, k, j)} ( x _{t} , a _{t} , x _{t + 1} )}{\sum _{t = 1}^{t_{m a x} - 1} \mathbbm 1 _{(i, k)} ( x _{t} , a _{t} )} .

\sum_{t=1}^{T-1}\big{(}y_{x_{t}a_{t}}-(c_{t}+\gamma\hat{v}(x_{t+1})-q)_{+}\big{)}^{2}\\ =\sum_{(i,k)\in\mathbb{X}\times\mathbb{A}}\sum_{t=1}^{T-1}\mathbbm{1}_{(i,k)}(x_{t},a_{t})\sum_{j\in\mathbb{X}}\mathbbm{1}_{j}(x_{t+1})\\ \big{(}y_{ik}-(C(i,k,j)+\gamma\hat{v}(j)-q)_{+}\big{)}^{2},

\sum_{t=1}^{T-1}\big{(}y_{x_{t}a_{t}}-(c_{t}+\gamma\hat{v}(x_{t+1})-q)_{+}\big{)}^{2}\\ =\sum_{(i,k)\in\mathbb{X}\times\mathbb{A}}\sum_{t=1}^{T-1}\mathbbm{1}_{(i,k)}(x_{t},a_{t})\sum_{j\in\mathbb{X}}\mathbbm{1}_{j}(x_{t+1})\\ \big{(}y_{ik}-(C(i,k,j)+\gamma\hat{v}(j)-q)_{+}\big{)}^{2},

\displaystyle y_{ik}=\sum_{j\in\mathbb{X}}\hat{T}^{k}_{ij}\big{(}C(i,k,j)+\gamma\hat{v}(j)-q\big{)}_{+},\quad(i,k)\in\mathbb{X}\times\mathbb{A}.

\displaystyle y_{ik}=\sum_{j\in\mathbb{X}}\hat{T}^{k}_{ij}\big{(}C(i,k,j)+\gamma\hat{v}(j)-q\big{)}_{+},\quad(i,k)\in\mathbb{X}\times\mathbb{A}.

\displaystyle\begin{cases}\begin{multlined}\hat{\theta}_{n+1}\in\operatorname*{arg\,min}_{\theta\in\Theta}\sup_{q\in[0,\frac{c_{\max}}{1-\gamma}]}\\[-11.99998pt] \sum_{t=1}^{T-1}\Big{(}f_{\theta}(x_{t},a_{t},q)-\big{(}c_{t}+\gamma\hat{v}_{n}(x_{t+1})-q\big{)}_{+}\Big{)}^{2},\end{multlined}\hat{\theta}_{n+1}\in\operatorname*{arg\,min}_{\theta\in\Theta}\sup_{q\in[0,\frac{c_{\max}}{1-\gamma}]}\\[-11.99998pt] \sum_{t=1}^{T-1}\Big{(}f_{\theta}(x_{t},a_{t},q)-\big{(}c_{t}+\gamma\hat{v}_{n}(x_{t+1})-q\big{)}_{+}\Big{)}^{2},\\ \begin{multlined}\hat{v}_{n+1}(i)=\inf_{\lambda\in\mathcal{P}(\mathbb{A})}\sup_{\mu\in\mathcal{M}}\\[-11.99998pt] \int_{(0,1]}\inf_{q\in[0,\frac{c_{\text{max}}}{1-\gamma}]}\bigg{\{}q+\xi^{-1}\sum_{k\in\mathbb{A}}\lambda_{k}f_{\hat{\theta}_{n+1}}(i,a,q)\bigg{\}}\mu(\operatorname{d\!}\xi).\end{multlined}\hat{v}_{n+1}(i)=\inf_{\lambda\in\mathcal{P}(\mathbb{A})}\sup_{\mu\in\mathcal{M}}\\[-11.99998pt] \int_{(0,1]}\inf_{q\in[0,\frac{c_{\text{max}}}{1-\gamma}]}\bigg{\{}q+\xi^{-1}\sum_{k\in\mathbb{A}}\lambda_{k}f_{\hat{\theta}_{n+1}}(i,a,q)\bigg{\}}\mu(\operatorname{d\!}\xi).\end{cases}

\displaystyle\begin{cases}\begin{multlined}\hat{\theta}_{n+1}\in\operatorname*{arg\,min}_{\theta\in\Theta}\sup_{q\in[0,\frac{c_{\max}}{1-\gamma}]}\\[-11.99998pt] \sum_{t=1}^{T-1}\Big{(}f_{\theta}(x_{t},a_{t},q)-\big{(}c_{t}+\gamma\hat{v}_{n}(x_{t+1})-q\big{)}_{+}\Big{)}^{2},\end{multlined}\hat{\theta}_{n+1}\in\operatorname*{arg\,min}_{\theta\in\Theta}\sup_{q\in[0,\frac{c_{\max}}{1-\gamma}]}\\[-11.99998pt] \sum_{t=1}^{T-1}\Big{(}f_{\theta}(x_{t},a_{t},q)-\big{(}c_{t}+\gamma\hat{v}_{n}(x_{t+1})-q\big{)}_{+}\Big{)}^{2},\\ \begin{multlined}\hat{v}_{n+1}(i)=\inf_{\lambda\in\mathcal{P}(\mathbb{A})}\sup_{\mu\in\mathcal{M}}\\[-11.99998pt] \int_{(0,1]}\inf_{q\in[0,\frac{c_{\text{max}}}{1-\gamma}]}\bigg{\{}q+\xi^{-1}\sum_{k\in\mathbb{A}}\lambda_{k}f_{\hat{\theta}_{n+1}}(i,a,q)\bigg{\}}\mu(\operatorname{d\!}\xi).\end{multlined}\hat{v}_{n+1}(i)=\inf_{\lambda\in\mathcal{P}(\mathbb{A})}\sup_{\mu\in\mathcal{M}}\\[-11.99998pt] \int_{(0,1]}\inf_{q\in[0,\frac{c_{\text{max}}}{1-\gamma}]}\bigg{\{}q+\xi^{-1}\sum_{k\in\mathbb{A}}\lambda_{k}f_{\hat{\theta}_{n+1}}(i,a,q)\bigg{\}}\mu(\operatorname{d\!}\xi).\end{cases}

\displaystyle\mathbb{P}\bigg{(}\sum_{r=t+1}^{t+\ell}\mathbbm{1}_{(i,k)}(X^{\pi}_{r},A^{\pi}_{r})\geq 1\bigg{|}\mathcal{F}^{\pi}_{t}\bigg{)}>\varepsilon_{e},

\displaystyle\mathbb{P}\bigg{(}\sum_{r=t+1}^{t+\ell}\mathbbm{1}_{(i,k)}(X^{\pi}_{r},A^{\pi}_{r})\geq 1\bigg{|}\mathcal{F}^{\pi}_{t}\bigg{)}>\varepsilon_{e},

\bigg{|}f_{\hat{\theta}_{\text{new}}}(i,k,q)-\sum_{j\in\mathbb{X}}\hat{T}^{k}_{ij}\big{(}C(i,k,j)+\gamma\hat{v}(x_{t+1})-q\big{)}_{+}\bigg{|}\\ \leq\varepsilon_{\theta},

\bigg{|}f_{\hat{\theta}_{\text{new}}}(i,k,q)-\sum_{j\in\mathbb{X}}\hat{T}^{k}_{ij}\big{(}C(i,k,j)+\gamma\hat{v}(x_{t+1})-q\big{)}_{+}\bigg{|}\\ \leq\varepsilon_{\theta},

\sup_{i\in\mathbb{X}}\bigg{|}\hat{v}_{\text{new}}(i)-\inf_{\lambda\in\mathcal{P}(\mathbb{A})}\sup_{\mu\in\mathcal{M}}\int_{0}^{1}\\ \inf_{q\in[0,\frac{c_{\text{max}}}{1-\gamma}]}\bigg{\{}q+\xi^{-1}\sum_{k\in\mathbb{A}}\lambda_{k}f_{\hat{\theta}_{\text{new}}}(i,k,q)\bigg{\}}\mu(\operatorname{d\!}\xi)\bigg{|}\leq\varepsilon_{v}.

\sup_{i\in\mathbb{X}}\bigg{|}\hat{v}_{\text{new}}(i)-\inf_{\lambda\in\mathcal{P}(\mathbb{A})}\sup_{\mu\in\mathcal{M}}\int_{0}^{1}\\ \inf_{q\in[0,\frac{c_{\text{max}}}{1-\gamma}]}\bigg{\{}q+\xi^{-1}\sum_{k\in\mathbb{A}}\lambda_{k}f_{\hat{\theta}_{\text{new}}}(i,k,q)\bigg{\}}\mu(\operatorname{d\!}\xi)\bigg{|}\leq\varepsilon_{v}.

\displaystyle 1-3|\mathbb{X}|^{2}|\mathbb{A}|\bigg{(}e^{-\frac{\varepsilon_{e}^{2}}{4}\lfloor\frac{t_{\max}-1}{\ell}\rfloor}+e^{-\frac{\varepsilon^{2}\varepsilon_{e}^{2}}{8\ell}\lfloor\frac{t_{\max}-1}{\ell}\rfloor}\bigg{)}

\displaystyle 1-3|\mathbb{X}|^{2}|\mathbb{A}|\bigg{(}e^{-\frac{\varepsilon_{e}^{2}}{4}\lfloor\frac{t_{\max}-1}{\ell}\rfloor}+e^{-\frac{\varepsilon^{2}\varepsilon_{e}^{2}}{8\ell}\lfloor\frac{t_{\max}-1}{\ell}\rfloor}\bigg{)}

∥ \overset{v}{^}_{n} - v^{*} ∥_{\infty} \leq γ^{n} ∥ \overset{v}{^}_{0} - v^{*} ∥_{\infty} + \frac{b ^{- 1} c _{m a x} ε}{( 1 - γ ) ^{2}} + \frac{b ^{- 1} ε _{θ} + ε _{v}}{1 - γ} .

∥ \overset{v}{^}_{n} - v^{*} ∥_{\infty} \leq γ^{n} ∥ \overset{v}{^}_{0} - v^{*} ∥_{\infty} + \frac{b ^{- 1} c _{m a x} ε}{( 1 - γ ) ^{2}} + \frac{b ^{- 1} ε _{θ} + ε _{v}}{1 - γ} .

\hat{\theta}_{n+1}\in\operatorname*{arg\,min}_{\theta\in\Theta}\\ \sum_{m=1}^{m_{\text{grid}}}\sum_{t=1}^{T-1}\bigg{(}f_{\theta}(x_{t},a_{t},q_{m})-\Big{(}c_{t}+\gamma\hat{v}_{n}(x_{t+1})-q_{m}\Big{)}_{+}\bigg{)}^{2}\\ +\beta\psi(\theta),

\hat{\theta}_{n+1}\in\operatorname*{arg\,min}_{\theta\in\Theta}\\ \sum_{m=1}^{m_{\text{grid}}}\sum_{t=1}^{T-1}\bigg{(}f_{\theta}(x_{t},a_{t},q_{m})-\Big{(}c_{t}+\gamma\hat{v}_{n}(x_{t+1})-q_{m}\Big{)}_{+}\bigg{)}^{2}\\ +\beta\psi(\theta),

\displaystyle\psi(\theta):=\sum_{(i,k)\in\mathbb{X}\times\mathbb{A}}\sum_{m=1}^{m_{\text{grid}}-1}\big{(}f_{\theta}(i,k,q_{m+1})-f_{\theta}(i,k,q_{m})\big{)}_{+},

\displaystyle\psi(\theta):=\sum_{(i,k)\in\mathbb{X}\times\mathbb{A}}\sum_{m=1}^{m_{\text{grid}}-1}\big{(}f_{\theta}(i,k,q_{m+1})-f_{\theta}(i,k,q_{m})\big{)}_{+},

\hat{v}_{n+1}(i)=\inf_{\lambda\in\mathcal{P}(\mathbb{A})}\max_{\mu\in\mathcal{M}}\\ \int_{(0,1]}\inf_{q\in[0,\frac{c_{\text{max}}}{1-\gamma}]}\bigg{\{}q+\xi^{-1}\sum_{k\in\mathbb{A}}\lambda_{k}f_{\hat{\theta}_{n+1}}(i,a,q)\bigg{\}}\mu(\operatorname{d\!}\xi),

\hat{v}_{n+1}(i)=\inf_{\lambda\in\mathcal{P}(\mathbb{A})}\max_{\mu\in\mathcal{M}}\\ \int_{(0,1]}\inf_{q\in[0,\frac{c_{\text{max}}}{1-\gamma}]}\bigg{\{}q+\xi^{-1}\sum_{k\in\mathbb{A}}\lambda_{k}f_{\hat{\theta}_{n+1}}(i,a,q)\bigg{\}}\mu(\operatorname{d\!}\xi),

\mathcal{M}=\big{\{}0.2\delta_{0.2}+0.8\delta_{1},\delta_{0.5},0.1\delta_{0.05}+0.5\delta_{0.4}+0.6\delta_{0.6},\\ 0.5\delta_{0.3}+0.5\delta_{0.8}\big{\}},

\mathcal{M}=\big{\{}0.2\delta_{0.2}+0.8\delta_{1},\delta_{0.5},0.1\delta_{0.05}+0.5\delta_{0.4}+0.6\delta_{0.6},\\ 0.5\delta_{0.3}+0.5\delta_{0.8}\big{\}},

\displaystyle\mathbb{P}\bigg{(}\bigg{|}\frac{\sum_{t=1}^{t_{\text{max}}-1}\mathbbm{1}_{(i,k,j)}(x_{t},a_{t},x_{t+1})}{\sum_{t=1}^{t_{\max}-1}\mathbbm{1}_{(i,k)}(x_{t},a_{t})}-T^{k}_{ij}\bigg{|}>\varepsilon\bigg{)}\leq\exp\bigg{(}-\frac{(N-\varepsilon_{e}\lfloor\frac{t_{\max}-1}{\ell}\rfloor)^{2}}{\lfloor\frac{t_{\max}-1}{\ell}\rfloor}\bigg{)}+2\exp\left(-\frac{\varepsilon^{2}N^{2}}{2t_{\max}}\right).

\displaystyle\mathbb{P}\bigg{(}\bigg{|}\frac{\sum_{t=1}^{t_{\text{max}}-1}\mathbbm{1}_{(i,k,j)}(x_{t},a_{t},x_{t+1})}{\sum_{t=1}^{t_{\max}-1}\mathbbm{1}_{(i,k)}(x_{t},a_{t})}-T^{k}_{ij}\bigg{|}>\varepsilon\bigg{)}\leq\exp\bigg{(}-\frac{(N-\varepsilon_{e}\lfloor\frac{t_{\max}-1}{\ell}\rfloor)^{2}}{\lfloor\frac{t_{\max}-1}{\ell}\rfloor}\bigg{)}+2\exp\left(-\frac{\varepsilon^{2}N^{2}}{2t_{\max}}\right).

\displaystyle\bigg{\{}\bigg{|}\frac{\sum_{t=1}^{t_{\text{max}}-1}\mathbbm{1}_{(i,k,j)}(X_{t},X_{t},X_{t+1})}{\sum_{t=1}^{t_{\max}-1}\mathbbm{1}_{(i,k)}(X_{t},X_{t})}-T^{k}_{ij}\bigg{|}\geq\varepsilon\bigg{\}}

\displaystyle\bigg{\{}\bigg{|}\frac{\sum_{t=1}^{t_{\text{max}}-1}\mathbbm{1}_{(i,k,j)}(X_{t},X_{t},X_{t+1})}{\sum_{t=1}^{t_{\max}-1}\mathbbm{1}_{(i,k)}(X_{t},X_{t})}-T^{k}_{ij}\bigg{|}\geq\varepsilon\bigg{\}}

\displaystyle\quad\subseteq\bigg{\{}\sum_{r=1}^{t_{\max}-1}\mathbbm{1}_{(i,k)}(X^{\pi}_{r},A^{\pi}_{r})<N\bigg{\}}\cup\bigg{(}\bigg{\{}\sum_{r=1}^{t_{\max}-1}\mathbbm{1}_{(i,k)}(X^{\pi}_{r},A^{\pi}_{r})\geq N\bigg{\}}\cap\bigg{\{}\bigg{|}\frac{\sum_{t=1}^{t_{\text{max}}-1}\mathbbm{1}_{(i,k,j)}(X_{t},X_{t},X_{t+1})}{\sum_{t=1}^{t_{\max}-1}\mathbbm{1}_{(i,k)}(X_{t},X_{t})}-T^{k}_{ij}\bigg{|}\geq\varepsilon\bigg{\}}\bigg{)}

\displaystyle\quad\subseteq\bigg{\{}\sum_{r=1}^{t_{\max}-1}\mathbbm{1}_{(i,k)}(X^{\pi}_{r},A^{\pi}_{r})<N\bigg{\}}\cup\bigg{\{}\bigg{|}\sum_{t=1}^{t_{\text{max}}-1}\mathbbm{1}_{(i,k,j)}(X_{t},X_{t},X_{t+1})-T^{k}_{ij}\sum_{t=1}^{t_{\max}-1}\mathbbm{1}_{(i,k)}(X_{t},X_{t})\bigg{|}\geq\varepsilon N\bigg{\}}.

\displaystyle\mathbb{P}\bigg{(}\bigg{|}\frac{\sum_{t=1}^{t_{\text{max}}-1}\mathbbm{1}_{(i,k,j)}(x_{t},a_{t},x_{t+1})}{\sum_{t=1}^{t_{\max}-1}\mathbbm{1}_{(i,k)}(x_{t},a_{t})}-T^{k}_{ij}\bigg{|}>\varepsilon\bigg{)}

\displaystyle\mathbb{P}\bigg{(}\bigg{|}\frac{\sum_{t=1}^{t_{\text{max}}-1}\mathbbm{1}_{(i,k,j)}(x_{t},a_{t},x_{t+1})}{\sum_{t=1}^{t_{\max}-1}\mathbbm{1}_{(i,k)}(x_{t},a_{t})}-T^{k}_{ij}\bigg{|}>\varepsilon\bigg{)}

\displaystyle\quad\leq\mathbb{P}\bigg{(}\sum_{r=1}^{t_{\max}-1}\mathbbm{1}_{(i,k)}(X^{\pi}_{r},A^{\pi}_{r})<N\bigg{)}+\mathbb{P}\bigg{(}\bigg{|}\sum_{t=1}^{t_{\text{max}}-1}\mathbbm{1}_{(i,k,j)}(X_{t},X_{t},X_{t+1})-T^{k}_{ij}\sum_{t=1}^{t_{\max}-1}\mathbbm{1}_{(i,k)}(X_{t},X_{t})\bigg{|}\geq\varepsilon N\bigg{)}.

L_{ι} := t = 1 \sum ℓ ι \mathbbm 1_{(i, k)} (X_{t}, A_{t}) - ε_{e} ι .

L_{ι} := t = 1 \sum ℓ ι \mathbbm 1_{(i, k)} (X_{t}, A_{t}) - ε_{e} ι .

\displaystyle\mathbb{E}\big{(}L_{\iota+1}\big{|}\mathscr{F}^{\pi}_{\ell\iota}\big{)}=L_{\iota}+\mathbb{E}\bigg{(}\sum_{t=\ell\iota+1}^{\ell(\iota+1)}\mathbbm{1}_{(i,k)}(X_{t},A_{t})-\varepsilon_{e}\bigg{|}\mathscr{F}^{\pi}_{\ell\iota}\bigg{)}\geq L_{\iota},

\displaystyle\mathbb{E}\big{(}L_{\iota+1}\big{|}\mathscr{F}^{\pi}_{\ell\iota}\big{)}=L_{\iota}+\mathbb{E}\bigg{(}\sum_{t=\ell\iota+1}^{\ell(\iota+1)}\mathbbm{1}_{(i,k)}(X_{t},A_{t})-\varepsilon_{e}\bigg{|}\mathscr{F}^{\pi}_{\ell\iota}\bigg{)}\geq L_{\iota},

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics · Explainable Artificial Intelligence (XAI) · Statistical Methods and Inference

Full text

Distributional Method for Risk Averse Reinforcement Learning

††thanks: SJ would like to acknowledge support from the Natural Sciences and Engineering Research Council of Canada (grants RGPIN-2018-05705 and RGPAS-2018-522715).

Ziteng Cheng14, Sebastian Jaimungal14 and Nick Martin34 1 Deptartment of Statistical Sciences, University of Toronto, Canada {sebastian.jaimungal, ziteng.cheng}@utoronto.ca3 [email protected] Equal contribution

Abstract

We introduce a distributional method for learning the optimal policy in risk averse Markov decision process with finite state action spaces, latent costs, and stationary dynamics. We assume sequential observations of states, actions, and costs and assess the performance of a policy using dynamic risk measures constructed from nested Kusuoka-type conditional risk mappings. For such performance criteria, randomized policies may outperform deterministic policies, therefore, the candidate policies lie in the d-dimensional simplex where d is the cardinality of the action space. Existing risk averse reinforcement learning methods seldom concern randomized policies, naive extensions to current setting suffer from the curse of dimensionality. By exploiting certain structures embedded in the corresponding dynamic programming principle, we propose a distributional learning method for seeking the optimal policy. The conditional distribution of the value function is casted into a specific type of function, which is chosen with in mind the ease of risk averse optimization. We use a deep neural network to approximate said function, illustrate that the proposed method avoids the curse of dimensionality in the exploration phase, and explore the method’s performance with a wide range of model parameters that are picked randomly.

Index Terms:

risk averse, Markov decision process, reinforcement learning, deep learning

I Introduction

Markov Decision Processes (MDPs) are a type of discrete-time stochastic control problem used for sequential decision-making in situations where costs are partially random and partially under the control of a decision maker. In risk-averse MDPs, the decision maker is concerned with the risk or variability of the outcomes beyond the expected costs. One way to incorporate risk aversion into MDPs is to use nested compositions of risk transition mappings. This approach ensures the time-consistency property and ultimately enables the use of a dynamic programming principle (DPP) to solve the corresponding sequential optimization problem. The approach is proposed in [21], where deterministic costs are considered. Both finite and infinite (required bounded costs) time horizon DPP are derived. Subsequent studies such as [23] and [8] explore infinite time horizon risk-averse DPPs with unbounded costs in different settings. [2] considers unbounded latent costs and established the corresponding finite and infinite horizon DPPs. More recently, [7] has developed a framework based on Kusuoka-type conditional risk mappings that also takes into account randomized actions in a risk averse manner. Depending on the type of conditional risk mappings used, randomized actions may be more preferable than deterministic actions, as illustrated in a motivating example in [7]. Other methods of incorporating risk aversion into MDPs are also available, including those discussed in [3], [8], [5], and the references therein.

The focus of this paper is on the approach of nested compositions of risk transition mappings. The main objective is to develop a reinforcement learning method that solves the infinite horizon risk-averse MDP problem presented in [7]. Specifically, the aim is to solve this problem with finite state and action spaces, deterministic latent costs, and stationary dynamics, without assuming knowledge of the controlled transition matrix or cost function. We begin by briefly reviewing some algorithms that solve infinite horizon risk-averse MDP problems.

For instance, [26] derives a policy gradient formula by combining the static gradient formula for coherent risk measure with the corresponding DPP. This approach is further developed in a sample-based method in [25], with its convergence analyzed in [12]. [29] proposes a family of sample-based algorithms to approximately solve problems with continuous state and action spaces. [24] presents and analyzes a risk-averse Q-learning algorithm, while [13] extends the previous Q-learning algorithm based on estimating a general minimax function with stochastic approximation, with detailed error analysis conducted in [14]. [17] studies a risk-averse temporal difference method that evaluates the value function using linear function approximations. Finally, the recent work in [9] develops an approach to address risk transition mappings induced by convex risk measures.

However, the methods mentioned above do not directly apply to the problem presented in [7], where the risk aversion also involves the randomness in the randomized actions. This is mainly because of the lack of linearity: the value function of a randomized action may not be a linear combination of the value functions of individual actions with respect to the randomizing action kernel. Naively extending the existing methods may result in a situation where we need to learn the value functions for numerous pairs of states and action kernels. Since the admissible action kernels form a $d$ -dimensional simplex, where $d$ is the size of the action space, the exploration task that follows may suffer from the curse of dimensionality and demand a significant amount of data. On the other hand, the finite nature of the underlying state and action spaces suggests that we can avoid such an excessively expensive exploration task.

We propose a distributional method to address the challenges posed by the risk aversion towards randomized actions in the problem presented in [7]. The proposed method learns an auxiliary function that contains sufficient information about the value function’s distribution, avoiding the curse of dimensionality and facilitating the computation of the value function defined via a risk transition mapping. We show in Theorem III.2 that the proposed method’s exploration effort grows polynomially with the state and action space cardinalities. Although we initially considered deterministic latent costs, our method naturally handles random costs whose distribution depends on the current state, the realized action, and the next state. This type of random cost is seldom considered in existing literature on risk-averse reinforcement learning. We provide numerical examples that demonstrate the efficacy of the proposed method at the end of this report.

Using distributional methods to solve MDP problems that are not risk neutral has a long history (cf.[15], [27], [19], and the reference therein). More recently, a series of works including [4], [10], [28], [20], and [30] have demonstrated that distributional methods can also achieve better results in the risk neutral setting. In this broader context, our method also contributes to the understanding of the capabilities of distributional methods in solving MDP problems.

II Preliminaries

In this section, we present the set up the this paper.

II-A Markov decision process

Let $(\Omega,\mathscr{F},\mathbb{P})$ be a probability space. We consider a time-homogeneous Markov decision process (MDP) with a finite state space $\mathbb{X}$ and finite action space $\mathbb{A}$ . For each $k\in\mathbb{A}$ , let $T^{k}\in\mathbb{R}^{|\mathbb{X}|\times|\mathbb{X}|}$ be a controlled transition matrix, where $T^{k}_{ij}$ is the probability of transitioning to state $j\in\mathbb{X}$ at the next epoch, given the current state $i\in\mathbb{X}$ and action $k\in\mathbb{A}$ . Let $\pi:\mathbb{X}\to\mathcal{P}(\mathbb{A})$ be a stationary Markovian policy, where $\mathcal{P}(\mathbb{A})$ is the set of probability measures on $\mathbb{A}$ . Since $\mathbb{A}$ is finite, $\mathcal{P}(\mathbb{A})$ is a $|\mathbb{A}|$ -dimensional simplex, and for $\lambda\in\mathcal{P}(\mathbb{A})$ , $\lambda_{k}$ is the probability of action $k$ occurring. The state-action process subject to policy $\pi$ is denoted by $\{(X^{\pi}_{t},A^{\pi}_{t})\}_{t\geq 0}$ . The MDP is associated with a bounded latent cost function $C:\mathbb{X}\times\mathbb{A}\times\mathbb{X}\to[0,c_{\text{max}}]$ , where $c_{\text{max}}>0$ is the upper bound of the cost. Finally, we let $\gamma\in(0,1)$ be the discount factor.

II-B Risk averse dynamic programming

In this paper, we use the notation $L^{\infty}(\Omega,\mathscr{F},\mathbb{P})$ to denote the space of bounded real-valued Borel-measurable random variables. Equality and inequality between random variables are understood in a $\mathbb{P}$ -almost sure sense. Let $\mathscr{G}\subseteq\mathscr{F}$ be a $\sigma$ -algebra. We say $\zeta:L^{\infty}(\Omega,\mathscr{F},\mathbb{P})\to L^{\infty}(\Omega,\mathscr{G},\mathbb{P})$ is a conditional risk mapping if $\zeta$ satisfies the following conditions for any $Z,Z^{1},Z^{2}\in L^{\infty}(\Omega,\mathscr{F},\mathbb{P})$ , $Y\in L^{\infty}(\Omega,\mathscr{G},\mathbb{P})$ and $\beta\geq 0$ ,

(i)

[Monotonicity] if $Z^{1}\leq Z^{2}$ , then $\zeta(Z^{1})\leq\zeta(Z^{2});$

(ii)

[Translation equivariance] $\zeta(Y+Z)=Y+\zeta(Z);$

(iii)

[Convexity] if $\beta\in[0,1]$ , then $\zeta(\beta Z^{1}+(1-\beta)Z^{2})\leq\beta\zeta(Z^{1})+(1-\beta)\zeta(Z^{2});$

(iv)

[Positive homogeneity] $\zeta(\beta Z)=\beta\zeta(Z).$

Some may replace $\beta$ in condition (iii) (resp. (iv)) with $Y\in[0,1]$ (resp. $Y\geq 0$ ), which, for the most part, does not affects the developing of the theory.

Consider $\mathscr{U}_{0}=\{\emptyset,\Omega\}\subseteq\mathscr{U}_{1}\subseteq\dots\subseteq\mathscr{F}$ , $\mathfrak{Z}=(Z_{t})_{t\in\mathbb{N}}\subset L^{\infty}(\Omega,\mathscr{F},\mathbb{P})$ and $\rho_{t}:L^{\infty}(\Omega,\mathscr{F},\mathbb{P})\to L^{\infty}(\Omega,\mathscr{U}_{t},\mathbb{P})$ . Suppose $|Z_{t}|\leq c_{\text{max}}$ for all $t\in\mathbb{N}$ . [21] proposes to use a dynamic risk measure of the form

[TABLE]

for MDP optimization problem. It can be shown that the construction above guarantees time consistency of $(\rho_{t,\infty})_{t\in\mathbb{N}}$ .111We do not adopt verbatim the setting from [21] for the sake of smooth transition.

In what follows, we let $\mathscr{U}^{\pi}_{0}$ be the trivial $\sigma$ -algebra, $\mathscr{U}^{\pi}_{t}$ be the $\sigma$ -algebra generated by $(X^{\pi}_{1},A^{\pi}_{1},\dots,X^{\pi}_{t-1},A^{\pi}_{t-1},X^{\pi}_{t})$ , and $\mathcal{M}$ be a set of discrete probability measures with support contained by $(0,1]$ . We consider a specific type of conditional risk mapping

[TABLE]

where the right hand side is inspired by Kusuoka representation of law-invariant coherent risk measure (cf. [18], [22, Section 6]). We note that

[TABLE]

is the conditional version of $\operatorname{\mathrm{AV}@\mathrm{R}}_{\xi}$ under the conditional distribution of $Z$ given $\mathscr{U}^{\pi}_{t}$ . The main goal of this paper is to develop a sample-based algorithm that solves the following infinite horizon risk averse MDP optimization problem

[TABLE]

where $\rho^{\pi}_{0,\infty}$ is defined analogously to (II.1).

The problem (II.3) can be solved using a dynamic programming principle. Specifically, we let $S$ be the Bellman operator acting on $v:\mathbb{X}\to\mathbb{R}$ defined as

[TABLE]

We can restrict $v$ to take values in $[0,\frac{c_{\text{max}}}{1-\gamma}]$ due to the boundedness of the cost. This allows us to replace $q\in\mathbb{R}$ in (II.4) with $q\in[0,\frac{c_{\text{max}}}{1-\gamma}]$ . It can be shown that $S$ is a $\gamma$ -contraction, and the fixed point of $S$ , denoted by $v^{*}$ , is the optimal value function. If $\pi^{*}:\mathbb{X}\to\mathcal{P}(\mathbb{A})$ attains the infimum in $Sv^{*}(i)$ for all $i\in\mathbb{X}$ , then $\pi^{*}$ is the optimal stationary policy, and in fact, it is also optimal among all history-dependent policies. We refer to [7] for more discussion in a general setting.

III Distributional method for risk-averse learning

In this section, we introduce a novel concept called $g$ -values. We then propose a learning method based on $g$ -values and establish a convergence result under suitable conditions, as stated in Theorem III.2. Finally, we provide a detailed description of the algorithm for implementing the method.

III-A $g$ -value

In view of (II.4), we define the $Q$ -value as

[TABLE]

and derive the following equation for $Q$ -learning

[TABLE]

However, learning the $Q$ function on a fine grid of $\mathbb{X}\times\mathcal{P}(\mathbb{A})$ turns out to be excessively expensive. Therefore, instead of continuing with (III.2), we propose to learn the following $g$ -value222By using $g$ to express trapezoidal shaped functions and invoking dominated convergence (cf. [6, Theorem 2.8.1]) and monotone class theorem (cf. [6, Theorem 1.9.3 (ii)]), it can be shown that the function $q\mapsto\mathbb{E}((Z-q)_{+})$ , $q\in\mathbb{R}$ , characterizes the distribution of $Z$ .333One may derive an equation for $g$ -value in analogous to (III.2), but such equation needs not leads to a contraction in general.

[TABLE]

Such $g$ -value has an advantage of being linear in $\lambda$ , which helps mitigates the cost of exploration. Moreover, it is worth noting that $q\mapsto g(i,\lambda,q)$ is non-increasing and $1$ -Lipschitz for any $(i,\lambda)\in\mathbb{X}\times\mathcal{P}(\mathbb{A})$ , which will be useful in future analysis. Furthermore, we argue that $g$ -value is aligned with our goal of solving (II.3), since by the aforementioned DPP and (III.3), we have

[TABLE]

This formula shows that the $g$ -value is a crucial ingredient in our approach for solving (II.3).

III-B Theoretical foundation

Suppose that we have observed the running states, actions and costs subject to some exploration policy upto time $t_{\text{max}}$ , resulting in a set of data $\{(x_{t},a_{t},x_{t+1},c_{t})\}_{t=1}^{t_{\text{max}}-1}$ , where $c_{t}=C(x_{t},a_{t},x_{t+1})$ . In order to approximate $g$ , we employ a parameterized model $f_{\theta}:\mathbb{X}\times\mathbb{A}\times\mathbb{R}\to\mathbb{R}$ , where $\theta\in\Theta$ is the parameter, and $f_{\theta}(i,k,q)$ is designated to approximate $g(i,\delta_{k},q)$ , where $\delta_{k}$ is the Dirac measure on $k$ . In view of (III.3), $g(i,\lambda,q)$ can be approximated by $\sum_{k\in\mathbb{A}}\lambda_{k}f_{\theta}(i,k,q)$ . We use $\hat{\theta}$ to denote the estimate of the optimal parameter (if exists). In view of the bounded cost and positive discount factor, we approximate $v^{*}$ with $\hat{v}:\mathbb{X}\to[0,\frac{c_{\text{max}}}{1-\gamma}]$ . Heuristically, we want to update $\hat{\theta}$ and $\hat{v}$ recursively in the following way

[TABLE]

where we define

[TABLE]

It is well-known that $\hat{T}^{k}_{ij}$ is the MLE of the transition probability (cf. [1]). Note that $\sum_{j\in\mathbb{X}}\hat{T}^{k}_{ij}\big{(}c_{t}+\gamma\hat{v}(x_{t+1})-q\big{)}_{+}$ is a convex and $1$ -Lipschitz function of $q$ that falls within the range $[0,\frac{c_{\text{max}}}{1-\gamma}]$ . Therefore, although the objective involves the supremum over an uncountable set, updating $\hat{\theta}$ is not infeasible. However, such an update requires knowledge of $C$ . To circumvent this requirement, we observe that for $q$ fixed,

[TABLE]

as a function of $(y_{ik})_{(i,k)\in\mathbb{X}\times\mathbb{A}}$ , attains the infimum if

[TABLE]

We can then use the following updating scheme as an alternative

[TABLE]

In order to obtain a convergence result, we make the following technical assumption.

Assumption III.1.

Let $c_{\text{max}}>0$ , $\ell\in\mathbb{N}$ , $b,\varepsilon_{e}\in(0,1)$ , $\varepsilon_{\theta},\varepsilon_{v}>0$ be absolute constants. We assume that

(i)

the range of the cost function $C$ is contained by $[0,c_{\text{max}}]$ ;

(ii)

$\sup_{\mu\in\mathcal{M}}\mu([0,b])=0$ ;

(iii)

$\{(X^{\pi}_{t},A^{\pi}_{t})\}_{t=1}^{T}$ * is subject to an exploration policy $\pi$ such that*

[TABLE]

for any $(t,i,k)\in\mathbb{N}\times\mathbb{X}\times\mathbb{A}$ , where $\mathcal{F}^{\pi}_{t}:=\sigma(X^{\pi}_{1},A^{\pi}_{1},\dots,X^{\pi}_{t},A^{\pi}_{t})$ ;

(iv)

regardless of the data and $\hat{v}:\mathbb{X}\to[0,\frac{c_{\text{max}}}{1-\gamma}]$ , we always find $\hat{\theta}_{\text{new}}\in\Theta$ and $\hat{v}_{\text{new}}:\mathbb{X}\to[0,\frac{c_{\text{max}}}{1-\gamma}]$ such that, for all $(i,k,q)\in\mathbb{X}\times\mathbb{A}\times[0,\frac{c_{\max}}{1-\gamma}]$ ,

[TABLE]

and

[TABLE]

Condition (i) and (ii) follows automatically from the setting above; these conditions are included in the assumption for the sake of easy navigation. Condition (iii) is a version of parallel sampling model (PSM). PSM was originally introduced in [16] and is commonly used in reinforcement learning literature as an exploration policy that achieves perfect exploration (cf. [11]). Condition (iv) regards the accuracy of the update. In particular, (III.8) corresponds to the computation of $\hat{\theta}_{n+1}$ in (III.7). Based on the separability of the objective illustrated in (III.5), the convexity of $\big{(}y_{ik}-(C(i,k,j)+\gamma\hat{v}(j)-q)_{+}\big{)}^{2}$ in $y_{ik}$ , and the observed good behavior of $q\mapsto\sum_{j\in\mathbb{X}}\hat{T}^{k}_{ij}\big{(}c_{t}+\gamma\hat{v}(x_{t+1})-q\big{)}_{+}$ , we consider (III.8) reasonable.

Below is our main result. The proof is deferred to the appendix.

Theorem III.2.

Suppose Assumption III.1. Let $t_{\max}>\ell$ . Given data $\{(x_{t},a_{t},x_{t+1},c_{t})\}_{t=1}^{t{\max}-1}$ and an arbitrary $\hat{v}_{0}:\mathbb{X}\to[0,\frac{c_{\text{max}}}{1-\gamma}]$ , we compute $\{(\hat{\theta}_{n},\hat{v}_{n})\}_{n\in\mathbb{N}}$ according to (III.7), approximately as in Assumption III.1 (iv). Then, for any $\varepsilon\in(0,1]$ , there is a probability of at least

[TABLE]

that

[TABLE]

for all $n\in\mathbb{N}$ .

Sometimes it is advisable to assume that $\varepsilon_{e}$ is proportional to $(|\mathbb{X}||\mathbb{A}|)^{-1}$ . In order to maintain the same level of accuracy (in terms of the probability bound), we need to set $t_{\max}\propto|\mathbb{X}|^{2}|\mathbb{A}|^{2}\log(|\mathbb{X}|^{2}|\mathbb{A}|)$ . In this case, the effort required for exploration only needs to grow polynomially as $|\mathbb{X}||\mathbb{A}|$ increases.

III-C Implementation

In our algorithm, we use a deep neural network for $f_{\theta}$ . We let $m_{\text{grid}}\in\mathbb{N}$ and $(q_{1},\dots,q_{m_{\text{grid}}})$ be a pre-selected grid on $[0,\frac{c_{\max}}{1-\gamma}]$ . We are given data $\{(x_{t},a_{t},x_{t+1},c_{t})\}_{t=1}^{t_{\text{max}}-1}$ , and an a priori guess $\hat{v}$ of the value function.

Instead of following strictly (III.7), we consider the updating procedure below

[TABLE]

where $\psi$ is a penalization for ensuring monotonicity on $q\mapsto f_{\theta}(i,k,q)$ , defined as

[TABLE]

and $\beta\geq 0$ is the regularization parameter. We use stochastic gradient descent for $\operatorname*{arg\,min}_{\theta\in\Theta}$ . After (III.10) is done, we perform

[TABLE]

where we recall that $\mathcal{M}$ is a finite set of discrete probabilities on $[0,1]$ , and thus the $\int_{(0,1]}$ is in fact a finite sum. We use gradient descent with random initialization for $\inf_{q\in[0,\frac{c_{\text{max}}}{1-\gamma}]}$ , and random search for $\inf_{\lambda\in\mathcal{P}(\mathbb{A})}$ . After (III.11) is done, we may return to (III.10) for next round of update. In order to obtain an approximated optimal policy $\hat{\pi}$ , we should record the approximated minimizor of $\inf_{\lambda\in\mathcal{P}(\mathbb{A})}$ for each $i$ .

We summarize the implementation in Algorithm (1). We point out that, Algorithm (1) can be integrate asynchronously into a larger implementation that involves running data.

IV Numeric experiments

In this section, we present numerical experiments to validate the performance of Algorithm 1. We consider a state-action space with $|\mathbb{X}|=|\mathbb{A}|=4$ , and a discount factor of $\gamma=0.3$ . We use the following $\mathcal{M}$ for the conditional risk mapping (II.2):

[TABLE]

where $\delta$ denotes the Dirac measure. The transition matrices used in the experiment are randomly generated. Although our algorithm was introduced for deterministic latent costs, it also handles random costs without requiring significant modifications. We test the algorithm with various random costs, such as $\text{Beta}(\alpha,\beta)$ with $\alpha,\beta:\mathbb{X}\times\mathbb{A}\times\mathbb{X}\to(0,\infty)$ depending on the current state, the realized action, and next state. We assume the knowledge of $[0,c_{\max}]$ , and set $(q_{1},\dots,q_{m_{\text{grid}}})$ as a uniform partition of $[0,c_{\max}]$ with $m_{\text{grid}}=100$ . We set $t_{\text{max}}=10000$ and sample according to some randomly picked stationary policy. We then compute $\hat{v}$ and $\hat{\pi}$ using Algorithm 1. To ensure accuracy when updating $\hat{v}$ and $\hat{\pi}$ , we perform a thorough random search. However, we conjecture that there is a certain structure that we can take advantage of in learning $\hat{v}$ and $\hat{\pi}$ , and the computation cost does not grow exponentially as $|\mathbb{A}|$ increases. In Figure 1, we plot the relative errors of $\hat{v}$ for each $i\in\mathbb{X}$ in 10 different experiments. The benchmark in each experiment is computed using brute force search.

Appendix A Proof of Theorem III.2

We fix $t_{\text{max}}>1$ and $\pi:\mathbb{X}\to\mathcal{P}(\mathbb{A})$ for the remainder of this section. Firstly, we will introduce the contraction property of $S$ .

Lemma A.1.

For any $v,v^{\prime}:\mathbb{X}\to\mathbb{R}$ , $\|Sv-Sv^{\prime}\|_{\infty}\leq\gamma\|v-v^{\prime}\|_{\infty}$ .

Proof.

This is an immediate consequence of [7, Lemma 3.3]. ∎

The proof of Theorem III.2 is also dependent on the following two technical lemmas.

Lemma A.2.

For any $(i,j,k)\in\mathbb{X}\times\mathbb{A}\times\mathbb{X}$ , $\varepsilon\in(0,1)$ and integer $N<\varepsilon_{e}\lfloor\frac{t_{\max}-1}{\ell}\rfloor$ , we have

[TABLE]

Proof.

To start with note that

[TABLE]

Therefore,

[TABLE]

In order to investigate the first term in right hand side of (A), we introduce an auxiliary process. For $\iota=1,\dots,\lfloor\frac{t_{\max}-1}{\ell}\rfloor$ , we let

[TABLE]

Note that $(L_{\iota})_{\iota=1}^{\lfloor\frac{t_{\max}-1}{\ell}\rfloor}$ is a sub-martingale under the filtration $(\mathscr{F}^{\pi}_{\ell\iota})_{r=1}^{\lfloor\frac{t_{\max}-1}{\ell}\rfloor}$ . Indeed, by the Markov property of $\{(X^{\pi}_{t},A^{\pi}_{t})\}_{t\in\mathbb{N}}$ , we have

[TABLE]

where we have used Assumption III.1 (iii) in the last equality. Then, by Azuma’s inequality, for $N<\varepsilon_{e}\lfloor\frac{t_{\max}-1}{\ell}\rfloor$ ,

[TABLE]

Regarding the second term in (A), we define $M^{ikj}_{1}:=0$ and

[TABLE]

Note that $(M^{ikj}_{t})_{t\in\mathbb{N}}$ is a $(\mathcal{F}^{\pi}_{t})_{t\in\mathbb{N}}$ -martingale:

[TABLE]

where we have used the Markov property of $\{(X^{\pi}_{t},A^{\pi}_{t})\}_{t\in\mathbb{N}}$ in the second line. It follows from Azuma’s inequality that

[TABLE]

Finally, by combining (A), (A) and (A.3), we complete the proof. ∎

Lemma A.3.

Let $v^{*}$ be the fixed point of $S$ defined in (II.4). Let $\hat{v}$ and $\hat{v}_{\text{new}}$ be introduced as in Assumption III.1 (iv). Then,

[TABLE]

Proof.

To start with, by (III.9) and the fact that $v^{*}=Sv^{*}$ ,

[TABLE]

Then, by Assumption III.1 (ii), (III.8) and Lemma A.1,

[TABLE]

The proof is complete. ∎

We are now in position to prove Theorem III.2.

Proof of Theorem III.2.

We first simplify Lemma A.2 by letting $N=\lceil\frac{1}{2}\varepsilon_{e}\lfloor\frac{t_{\max}-1}{\ell}\rfloor\rceil$

[TABLE]

where we have used the fact that $\lfloor\frac{t_{\max}-1}{\ell}\rfloor\geq\frac{t_{\max}}{\ell}-1\geq 0$ in the second line. Consequently,

[TABLE]

Finally, under the realization that $\bigg{|}\frac{\sum_{t=1}^{t_{\text{max}}-1}\mathbbm{1}_{(i,k,j)}(x_{t},a_{t},x_{t+1})}{\sum_{t=1}^{t_{\max}-1}\mathbbm{1}_{(i,k)}(x_{t},a_{t})}-T^{k}_{ij}\bigg{|}\leq\varepsilon$ for all $(i,k,j)\in\mathbb{X}\times\mathbb{A}\times\mathbb{X}$ , invoking (A.3) iteratively, we yield

[TABLE]

which completes the proof. ∎

Bibliography30

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] T. W. Anderson and L. A. Goodman, “Statistical inference about markov chains,” The Annals of Mathematical Statistics , vol. 28, no. 1, p. 89–110, 1957.
2[2] N. Bäuerle and A. Glauner, “Markov decision processes with iterated coherent risk measures,” European Journal of Operational Research , vol. 296, no. 3, pp. 953–966, 2022.
3[3] N. Bäuerle and U. Rieder, “More risk-sensitive markov decision processes,” Mathematics of Operations Research , vol. 39, no. 1, pp. 105–120, 2013.
4[4] M. G. Bellemare, W. Dabney, and R. Munos, “A distributional perspective on reinforcement learning,” Proceedings of Machine Learning Research , vol. 70, pp. 449–458, 2017.
5[5] T. R. Bielecki, T. Chen, and I. Cialenco, “Risk-sensitive markov decision problems under model uncertainty: finite time horizon case,” ar Xiv:2104.06915 , 2021.
6[6] V. I. Bogachev, Measure Theory Volume I . Springer-Verlag Berlin Heidelberg, 2007.
7[7] Z. Cheng and S. Jaimungal, “Markov decision processes with kusuoka-type conditional risk mappings,” Preprint , 2022.
8[8] S. Chu and Y. Zhang, “Markov decision processes with iterated coherent risk measures,” International Journal of Control , vol. 88, no. 11, pp. 2286–2293, 2014.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

Distributional Method for Risk Averse Reinforcement Learning

Abstract

Index Terms:

I Introduction

II Preliminaries

II-A Markov decision process

II-B Risk averse dynamic programming

III Distributional method for risk-averse learning

III-A ggg-value

III-B Theoretical foundation

Assumption III.1**.**

Theorem III.2**.**

III-C *Implementation *

IV Numeric experiments

Appendix A Proof of Theorem III.2

Lemma A.1**.**

Proof.

Lemma A.2**.**

Proof.

Lemma A.3**.**

Proof.

Proof of Theorem III.2.

III-A $g$ -value

Assumption III.1.

Theorem III.2.

III-C Implementation

Lemma A.1.

Lemma A.2.

Lemma A.3.