Convergence of regularized agent-state-based Q-learning in POMDPs

Amit Sinha; Matthieu Geist; Aditya Mahajan

arXiv:2508.21314·cs.LG·September 4, 2025

Convergence of regularized agent-state-based Q-learning in POMDPs

Amit Sinha, Matthieu Geist, Aditya Mahajan

PDF

Open Access

TL;DR

This paper analyzes the convergence of a class of Q-learning algorithms in POMDPs that use agent states and regularization, providing theoretical guarantees and empirical validation.

Contribution

It introduces RASQL, a framework for understanding convergence of regularized agent-state-based Q-learning in POMDPs, including variants for periodic policies.

Findings

01

RASQL converges to a fixed point of a regularized MDP.

02

Convergence holds under mild technical conditions.

03

Empirical results match theoretical predictions.

Abstract

In this paper, we present a framework to understand the convergence of commonly used Q-learning reinforcement learning algorithms in practice. Two salient features of such algorithms are: (i)~the Q-table is recursively updated using an agent state (such as the state of a recurrent neural network) which is not a belief state or an information state and (ii)~policy regularization is often used to encourage exploration and stabilize the learning algorithm. We investigate the simplest form of such Q-learning algorithms which we call regularized agent-state-based Q-learning (RASQL) and show that it converges under mild technical conditions to the fixed point of an appropriately defined regularized MDP, which depends on the stationary distribution induced by the behavioral policy. We also show that a similar analysis continues to work for a variant of RASQL that learns periodic policies. We…

Equations182

\displaystyle\Omega^{\star}(q)=\max_{p\in\mathds{R}^{n}}\bigl{\{}\langle p,q\rangle-\Omega(p)\bigr{\}}.

\displaystyle\Omega^{\star}(q)=\max_{p\in\mathds{R}^{n}}\bigl{\{}\langle p,q\rangle-\Omega(p)\bigr{\}}.

\nabla\Omega^{\star}(q)=\operatorname*{arg\,max}_{p\in\Delta}\bigl{\{}\langle p,q\rangle-\Omega(p)\bigr{\}}.

\nabla\Omega^{\star}(q)=\operatorname*{arg\,max}_{p\in\Delta}\bigl{\{}\langle p,q\rangle-\Omega(p)\bigr{\}}.

p^{⋆} (a) = \frac{exp ( β q ( a ))}{\sum _{a^{'} \in A} exp ( β q ( a ^{'} ))} .

p^{⋆} (a) = \frac{exp ( β q ( a ))}{\sum _{a^{'} \in A} exp ( β q ( a ^{'} ))} .

p^{⋆} (a) = \frac{p _{\textsc r e f} ( a ) exp ( β q ( a ))}{\sum _{a^{'} \in A} p _{\textsc r e f} ( a ^{'} ) exp ( β q ( a ^{'} ))} .

p^{⋆} (a) = \frac{p _{\textsc r e f} ( a ) exp ( β q ( a ))}{\sum _{a^{'} \in A} p _{\textsc r e f} ( a ^{'} ) exp ( β q ( a ^{'} ))} .

\mathds P (s_{t + 1} ∣ s_{1 : t}, a_{1 : t}) = \mathds P (s_{t + 1} ∣ s_{t}, a_{t}) = : P (s_{t + 1} ∣ s_{t}, a_{t}),

\mathds P (s_{t + 1} ∣ s_{1 : t}, a_{1 : t}) = \mathds P (s_{t + 1} ∣ s_{t}, a_{t}) = : P (s_{t + 1} ∣ s_{t}, a_{t}),

J^{\Omega}_{\pi}\coloneqq\mathds{E}^{\pi}\biggl{[}\medop\sum_{t=1}^{\infty}\gamma^{t-1}\bigl{[}r(s_{t},a_{t})-\Omega(\pi(\cdot\mid s_{t}))\bigr{]}\biggm{|}s_{1}\sim\rho\biggr{]},

J^{\Omega}_{\pi}\coloneqq\mathds{E}^{\pi}\biggl{[}\medop\sum_{t=1}^{\infty}\gamma^{t-1}\bigl{[}r(s_{t},a_{t})-\Omega(\pi(\cdot\mid s_{t}))\bigr{]}\biggm{|}s_{1}\sim\rho\biggr{]},

B^{Ω} Q (s, a) = r (s, a) + γ s^{'} \in S \sum P (s^{'} ∣ s, a) Ω^{⋆} (Q (s^{'}, \cdot)),

B^{Ω} Q (s, a) = r (s, a) + γ s^{'} \in S \sum P (s^{'} ∣ s, a) Ω^{⋆} (Q (s^{'}, \cdot)),

Q^{Ω} (s, a) = r (s, a) + γ s^{'} \in S \sum P (s^{'} ∣ s, a) Ω^{⋆} (Q^{Ω} (s^{'}, \cdot)) .

Q^{Ω} (s, a) = r (s, a) + γ s^{'} \in S \sum P (s^{'} ∣ s, a) Ω^{⋆} (Q^{Ω} (s^{'}, \cdot)) .

π^{Ω, ⋆} (\cdot ∣ s)

π^{Ω, ⋆} (\cdot ∣ s)

= ξ \in Δ (A) arg max {a \in A \sum ξ (a) Q^{Ω} (s, a) - Ω (ξ)}

\mathds P (s_{t + 1}, y_{t + 1} ∣ s_{1 : t}, y_{1 : t}, a_{1 : t})

\mathds P (s_{t + 1}, y_{t + 1} ∣ s_{1 : t}, y_{1 : t}, a_{1 : t})

= : P (s_{t + 1}, y_{t + 1} ∣ s_{t}, a_{t}),

J_{\boldsymbol{\vec{{\pi}}}}\coloneqq\mathds{E}^{\boldsymbol{\vec{{\pi}}}}\biggl{[}\medop\sum_{t=1}^{\infty}\gamma^{t-1}r(s_{t},a_{t})\biggm{|}s_{1}\sim\rho\biggr{]},

J_{\boldsymbol{\vec{{\pi}}}}\coloneqq\mathds{E}^{\boldsymbol{\vec{{\pi}}}}\biggl{[}\medop\sum_{t=1}^{\infty}\gamma^{t-1}r(s_{t},a_{t})\biggm{|}s_{1}\sim\rho\biggr{]},

z_{t + 1} = ϕ (z_{t}, y_{t + 1}, a_{t}), t \geq 0

z_{t + 1} = ϕ (z_{t}, y_{t + 1}, a_{t}), t \geq 0

Q_{t + 1} (z, a) = Q_{t} (z, a)

Q_{t + 1} (z, a) = Q_{t} (z, a)

+ α_{t} (z, a) [r_{t} + γ Ω^{⋆} (Q_{t} (z_{t + 1}, \cdot)) - Q_{t} (z, a)],

Q_{t + 1} (z_{t}, a_{t}) = Q_{t} (z_{t}, a_{t}) +

Q_{t + 1} (z_{t}, a_{t}) = Q_{t} (z_{t}, a_{t}) +

α_{t} (z_{t}, a_{t}) [r_{t} + γ a^{'} \in A max Q_{t} (z_{t + 1}, a^{'}) - Q_{t} (z_{t}, a_{t})] .

r_{μ} (z, a)

r_{μ} (z, a)

P_{μ} (z^{'} ∣ z, a)

Q_{μ} (z, a) = r_{μ} (z, a) + γ z^{'} \in Z \sum P_{μ} (z^{'} ∣ z, a) Ω^{⋆} (Q_{μ} (z^{'}, \cdot)) .

Q_{μ} (z, a) = r_{μ} (z, a) + γ z^{'} \in Z \sum P_{μ} (z^{'} ∣ z, a) Ω^{⋆} (Q_{μ} (z^{'}, \cdot)) .

Q_{t + 1}^{ℓ} (z, a) = Q_{t}^{ℓ} (z, a)

Q_{t + 1}^{ℓ} (z, a) = Q_{t}^{ℓ} (z, a)

+ α_{t}^{ℓ} (z, a) [r_{t} + γ Ω^{⋆} (Q_{t}^{[[ℓ + 1]]} (z^{'}, \cdot)) - Q_{t}^{ℓ} (z, a)] .

r_{μ}^{ℓ} (z, a)

r_{μ}^{ℓ} (z, a)

P_{μ}^{ℓ} (z^{'} ∣ z, a)

B_{μ}^{ℓ} Q (z, a) = r_{μ}^{ℓ} (z, a) + γ z^{'} \in Z \sum P_{μ}^{ℓ} (z^{'} ∣ z, a) Ω^{⋆} (Q (z^{'}, \cdot)) .

B_{μ}^{ℓ} Q (z, a) = r_{μ}^{ℓ} (z, a) + γ z^{'} \in Z \sum P_{μ}^{ℓ} (z^{'} ∣ z, a) Ω^{⋆} (Q (z^{'}, \cdot)) .

B_{μ}^{ℓ} = B_{μ}^{ℓ} B_{μ}^{[[ℓ + 1]]} \dots B_{μ}^{[[ℓ + L - 1]]} .

B_{μ}^{ℓ} = B_{μ}^{ℓ} B_{μ}^{[[ℓ + 1]]} \dots B_{μ}^{[[ℓ + L - 1]]} .

Q_{μ}^{ℓ} (z, a) = r_{μ}^{ℓ} (z, a) + γ z^{'} \in Z \sum P_{μ}^{ℓ} (z^{'} ∣ z, a) V_{μ}^{[[ℓ + 1]]} (z^{'}) .

Q_{μ}^{ℓ} (z, a) = r_{μ}^{ℓ} (z, a) + γ z^{'} \in Z \sum P_{μ}^{ℓ} (z^{'} ∣ z, a) V_{μ}^{[[ℓ + 1]]} (z^{'}) .

ρ (s) = [0.3, 0.0, 0.2, 0.5]

ρ (s) = [0.3, 0.0, 0.2, 0.5]

r (s, 0)

r (s, 0)

P (s^{'} ∣ s, 0)

r (s, 1)

r (s, 1)

P (s^{'} ∣ s, 1)

μ (a ∣ z) = [0.2 0.8 0.8 0.2] .

μ (a ∣ z) = [0.2 0.8 0.8 0.2] .

μ^{0} (a ∣ z) = [0.2 0.8 0.8 0.2], μ^{1} (a ∣ z) = [0.8 0.2 0.2 0.8] .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdaptive Dynamic Programming Control · Reinforcement Learning in Robotics · Neural Networks and Reservoir Computing

Full text

Convergence of regularized agent-state-based Q-learning in POMDPs

Amit Sinha, Matthieu Geist, Aditya Mahajan A. Sinha and A. Mahajan are with the Department of Electrical and Computer Engineering, McGill University, Canada. Emails: [email protected], [email protected]. Their work was supported in part by in part by a grant from Google’s Institutional Research Program in collaboration with Mila. M. Geist is with the Earth Species Project. Email: [email protected]

Abstract

In this paper, we present a framework to understand the convergence of commonly used Q-learning reinforcement learning algorithms in practice. Two salient features of such algorithms are: (i) the Q-table is recursively updated using an agent state (such as the state of a recurrent neural network) which is not a belief state or an information state and (ii) policy regularization is often used to encourage exploration and stabilize the learning algorithm. We investigate the simplest form of such Q-learning algorithms which we call regularized agent-state-based Q-learning (RASQL) and show that it converges under mild technical conditions to the fixed point of an appropriately defined regularized MDP, which depends on the stationary distribution induced by the behavioral policy. We also show that a similar analysis continues to work for a variant of RASQL that learns periodic policies. We present numerical examples to illustrate that the empirical convergence behavior matches with the proposed theoretical limit.

I Introduction

Reinforcement learning (RL) is a useful paradigm in learning optimal control policies via simulation when the system model is not available or when the system is too large to explicitly solve the dynamic program. The simplest setting is the fully-observed setting of Markov decision processes (MDP), where the controller has access to the environment state. Most existing theoretical RL results on convergence of learning algorithms and their rates of convergence and regret bounds, etc. are established for the MDP setting.

However, in many real-world applications, such as autonomous driving, robotics, healthcare, finance, and others, the controller does not have access to the environment state; rather, it has a partial observation of the environment state. So these applications need to be modeled as a partially observable Markov decision process (POMDP) rather than a MDP.

When the system model is known, the POMDP model can be converted into an MDP by considering the controller’s belief on the state of the environment (also called the belief state) as an information state [1, 2, 3]. However, such a reduction does not work in the RL setting because the belief state depends on the system model, which is unknown. Nonetheless, there have been several empirical works which show that standard RL algorithms for MDPs continue to work for POMDPs if one uses “frame stacking” (i.e., use the last few observations as a state) or recurrent neural networks [4, 5, 6, 7]. In recent years, considerable progress has been made in understanding the properties of such algorithms but a complete theoretical understanding is still lacking.

A common way to model such RL algorithms for POMDPs is to consider the state of the controller as an agent state [8]. Such agent-state-based-controllers have also been considered in the planning setting as they can be simpler to implement than belief-state based controllers. See [9] for an overview.

A challenge in understanding the convergence of agent-state-based RL algorithms for POMDPs is that an agent state is not an information state. So, it is not possible to write a dynamic programming decomposition based on the agent state. So, one cannot follow the typical proof techniques used to evaluate the convergence of RL algorithms for MDPs (where RL algorithms can be viewed as stochastic approximation variant of MDP algorithms such as value iteration and policy iteration to compute the optimal policy).

There is a good understanding of the convergence of agent-state-based Q-learning (ASQL) for POMDPs [10, 11, 12] (which is related to Q-learning for non-Markovian environments [13, 14]). There is also some work on understanding the convergence of actor-critic algorithms for POMDPs [15, 3]. However, most practical RL algorithms for POMDPs use some form of policy regularization, while most theoretical analysis is restricted to the unregularized setting.

Regularization adds an auxiliary loss to the per-step rewards. This loss typically depends on the policy but may also depend on the value function. Regularization is commonly used in RL algorithms for various reasons, such as entropy regularization to encourage exploration [16, 17, 18] and improve generalization [19], KL-regularization to constrain the policy updates to be similar to a prior policy [20, 21], and others. Unified theory for different facets of regularization in MDPs is provided in [22, 23].

Based on the various benefits of regularization in RL for MDPs, it is also commonly used in RL for POMDPs [5, 21, 16, 24, 25, 26, 3, 27]. However, the recent theoretical analysis of RL for POMDPs discussed above do not consider regularization. The objective of this paper is to present initial results on understanding regularization in RL for POMDPs.

There is some recent work on understanding regularization in POMDPs but they either consider the role of entropy regularization in POMDP solvers (when the model information is known) [28, 29], or consider regularization of the belief distribution [30] or observation distribution [31]. These results do not directly provide an understanding of the role of regularization in RL for POMDPs.

In this paper, we revisit Q-learning for POMDPs when the learning agent is using an agent state and using policy regularization. Our main contribution is to show that in this setting, Q-learning converges under mild technical conditions. We characterize the converged limit in terms of the model parameters and choice of behavioral policy used in Q-learning. Recently, it has been argued that periodic policies may perform better when considering agent-state-based POMDPs [12]. We show that our analysis extends to a periodic version of regularized Q-learning as well.

Notation: We use uppercase letters to denote random variables (e.g. $S,A$ , etc.), lowercase letters to denote their realizations (e.g. $s,a$ , etc.) and calligraphic letters to denote sets (e.g. $\mathsf{S},\mathsf{A}$ ; etc.). Subscripts (e.g. $S_{t},A_{t}$ , etc.) denote variables at time $t$ . Similarly, $S_{1:t}$ denotes the collection of random variables from time $1$ to $t$ . $\Delta(\mathsf{S})$ denotes the space of probability measures on a set $\mathsf{S}$ ; $\mathds{P}(\cdot)$ and $\mathds{E}[\cdot]$ denote the probability of an event and the expectation of a random variable, respectively; and $\mathds{1}(\cdot)$ denotes the indicator function. $\lvert\mathsf{S}\rvert$ denotes the number of elements in $\mathsf{S}$ (when it is a finite set). $\mathds{R}$ denotes real numbers. $[L]$ denotes the set of integers from [math] to $L-1$ , where $L\in\mathds{Z}^{+}$ . $\llbracket\ell\rrbracket$ denotes $(\ell\text{ mod }L)$ .

II Background

II-A Legendre-Fenchel transform (convex conjugate)

We start with a short review of convex conjugates and Legendre-Fenchel transforms [32], which are an important tool to understand regularization in MDPs [23].

Definition 1

For a strongly convex function $\Omega\colon\mathds{R}^{n}\to\mathds{R}$ , its convex conjugate $\Omega^{\star}\colon\mathds{R}^{n}\to\mathds{R}$ is defined as

[TABLE]

The mapping $\Omega\mapsto\Omega^{*}$ is the Legendre-Fenchel transform. □

The following is a useful property of the Legendre-Fenchel transform for regularized MDPs:

Lemma 1 (Based on [33, 34])

Let $\Delta$ be a simplex in $\mathds{R}^{n}$ and $\Omega\colon\Delta\to\mathds{R}$ be twice differentiable and a strongly convex function. Let $\Omega^{\star}\colon\mathds{R}^{n}\to\mathds{R}$ be the Legendre-Fenchel transform of $\Omega$ . Then, $\nabla\Omega^{\star}$ is Lipschitz and satisfies

[TABLE]

□

In Markov decision processes, one often regularizes the policy. Below we describe some of the commonly used policy regularizers. For the purpose of the discussion below, let $\mathsf{A}$ be a finite set (later we will take $\mathsf{A}$ to be the set of actions of an MDP, but for now we can consider it as a generic set).

Entropy regularization uses the regularizer $\Omega\colon\Delta(\mathsf{A})\to\mathds{R}$ given by $\Omega(p)=\frac{1}{\beta}\medop\sum_{a\in\mathsf{A}}p(a)\ln p(a)$ where $\beta\in\mathds{R}_{>0}$ is a parameter. Its convex conjugate $\Omega^{\star}\colon\mathds{R}^{|\mathsf{A}|}\to\mathds{R}$ is given by $\Omega^{\star}(q)=\frac{1}{\beta}\ln\bigl{(}\medop\sum_{a\in\mathsf{A}}\exp(\beta q(a))\bigr{)}.$ Furthermore, from Lemma 1, we get that the argmax in the definition of convex conjugate is achieved by

[TABLE] 2. 2.

KL regularization uses the regularizer $\Omega\colon\Delta(\mathsf{A})\to\mathds{R}$ given by $\Omega(p)=\frac{1}{\beta}\medop\sum_{a\in\mathsf{A}}p(a)\ln({p(a)}/{p_{\textsc{{ref}}}(a)}),$ where $\beta\in\mathds{R}_{>0}$ is a parameter and $p_{\textsc{{ref}}}\in\Delta(\mathsf{A})$ is a reference distribution. Its convex conjugate $\Omega^{\star}\colon\mathds{R}^{|\mathsf{A}|}\to\mathds{R}$ is given by $\Omega^{\star}(q)=\frac{1}{\beta}\ln\bigl{(}\medop\sum_{a\in\mathsf{A}}p_{\textsc{{ref}}}(a)\exp(\beta q(a))\bigr{)}.$ Furthermore, from Lemma 1, we get that the argmax in the definition of convex conjugate is achieved by

[TABLE]

II-B Regularized MDPs

In this section, we provide a brief review of regularized Markov decision processes (MDPs), which are a generalization of standard MDPs with an additional “regularization cost” at each stage.

Consider a Markov decision process (MDP) with state $s_{t}\in\mathsf{S}$ , control action $a_{t}\in\mathsf{A}$ , where all sets are finite. The system operates in discrete time. The initial state $s_{1}\sim\rho$ and for any time $t\in\mathds{N}$ , we have

[TABLE]

where $P$ is a probability transition matrix. The system yields a reward $R_{t}=r(s_{t},a_{t})\in[0,R_{\max}]$ . The rewards are discounted by a factor $\gamma\in[0,1)$ .

Consider a policy $\pi:\mathsf{S}\to\Delta(\mathsf{A})$ . Let $\Omega\colon\Delta(\mathsf{A})\to\mathds{R}$ be a strongly convex function that is used as a policy regularizer. Then, the regularized performance of policy $\pi$ is given by

[TABLE]

where the notation $\mathds{E}^{\pi}$ means that the expectation is taken with the joint measure on the system variables induced by the policy $\pi$ .

The objective in a regularized MDP is to find a policy $\pi$ that maximizes the regularized performance $J^{\Omega}_{\pi}$ defined above. A key step in understanding the optimal solution of the regularized MDP is to define the regularized Bellman operator $\mathcal{B}^{\Omega}$ on the space of real-valued functions on $\mathsf{S}\times\mathsf{A}$ as follows. For any $Q:\mathsf{S}\times\mathsf{A}\to\mathds{R}$ ,

[TABLE]

where $\Omega^{\star}$ is the Legendre-Fenchel transform of $\Omega$ .

Proposition 1 (Based on [23])

The following results hold:

The operator $\mathcal{B}^{\Omega}$ is a contraction and therefore has a unique fixed point, which we denote by $Q^{\Omega}$ . By definition,

[TABLE] 2. 2.

Define the policy $\pi^{\Omega,*}\colon\mathsf{S}\to\Delta(\mathsf{A})$ as follows: for any $s\in\mathsf{S}$ ,

[TABLE]

where the last equality follows from Lemma 1. Then, the policy $\pi^{\Omega,\star}$ is optimal for maximizing the regularized performance $J^{\Omega}_{\pi}$ over the set of all policies.

□

III System model and regularized Q-learning for POMDPs

III-A Model for POMDPs

Consider a partially observable Markov decision process (POMDP) with state $s_{t}\in\mathsf{S}$ , control action $a_{t}\in\mathsf{A}$ , and output $y_{t}\in\mathsf{Y}$ , where all sets are finite. The system operates in discrete time with the dynamics given as follows. The initial state $s_{1}\sim\rho$ and for any time $t\in\mathds{N}$ , we have

[TABLE]

where $P$ is a probability transition matrix. In addition, at each time the system yields a reward $r_{t}=r(s_{t},a_{t})\in[0,R_{\max}]$ . The rewards are discounted by a factor $\gamma\in[0,1)$ .

Let $\boldsymbol{\vec{\pi}}=(\vec{\pi}_{1},\vec{\pi}_{2},\dots)$ denote any (history dependent and possibly randomized) policy, i.e., under policy $\boldsymbol{\vec{\pi}}$ the action at time $t$ is chosen as $a_{t}\sim\vec{\pi}_{t}(y_{1:t},a_{1:t-1})$ . The performance of policy $\boldsymbol{\vec{{\pi}}}$ is given by

[TABLE]

where the notation $\mathds{E}^{\boldsymbol{\vec{\pi}}}$ means that the expectation is taken with the joint measure on the system variables induced by the policy $\boldsymbol{\vec{\pi}}$ .

The objective is to find a (history dependent and possibly randomized) policy $\boldsymbol{\vec{{\pi}}}$ to maximize $J_{\boldsymbol{\vec{{\pi}}}}$ . When the system model is known, the above POMDP model can be converted to a fully observed Markov decision process (MDP) by considering the controller’s posterior belief on the system state as an information state [1, 2]. However, when the system model is not known, it is not possible to run reinforcement learning (RL) algorithms on the belief-state MDP because the belief depends on the system model. For that reason, in RL for POMDPs it is often assumed that the controller is an agent-state-based controller.

Definition 2 (Agent state)

An agent state is a model-free recursively updateable function of the history of observations and actions. In particular, let $\mathsf{Z}$ denote the agent state space. Then, the agent state is a process $\{z_{t}\}_{t\geq 0}$ , $z_{t}\in\mathsf{Z}$ , which starts with some initial value $z_{0}$ , and is then recursively computed as

[TABLE]

where $\phi$ is a pre-specified agent-state update function. □

Some examples of agent-state-based controllers are: (i) a finite memory controller, which chooses the actions based on the previous $k$ observations; (ii) a finite state controller, which effectively filters the possible histories to values from a finite set $\mathsf{Z}$ . We refer the reader to [9] for a detailed review of agent-state-based policies in POMDPs.

We use $\boldsymbol{\pi}=(\pi_{1},\pi_{2},\dots)$ to denote an agent-state-based policy,111We use $\boldsymbol{\vec{\pi}}$ to denote history dependent policies and $\boldsymbol{\pi}$ to denote agent-state-based policies. i.e., a policy where the action at time $t$ is given by $a_{t}\sim\pi_{t}(z_{t})$ . An agent-state-based policy is said to be stationary if for all $t$ and $t^{\prime}$ , we have $\pi_{t}(a\mid z)=\pi_{t^{\prime}}(a\mid z)$ for all $(z,a)\in\mathsf{Z}\times\mathsf{A}$ .

If the agent state is an information state, then MDP-based RL algorithms can directly be applied to find optimal stationary solutions [3]. However, in general, an agent state is not an information state, as is the case in frame-stacking or when using recurrent neural networks. In such settings, the dynamics of the agent state process is non Markovian and the standard dynamic programming based argument does not work. It is possible to find the optimal policy by viewing the POMDP with an agent-state-based controller as a decentralized control problem and using the designer’s approach [35] to compute an optimal agent-state-based policy, as is done in [9], but such an approach is intractable for all but small toy problems.

The Q-learning algorithms for POMDPs maintain a Q-table based on the agent states and actions and update the Q-values based on the samples generated by the environment. Since the agent state is non Markovian, it is not clear if such an iterative scheme converges, and if so, to what value. In the next section, we present a formal model for agent state based Q-learning when the agent also uses policy regularization.

III-B Regularized agent-state-based Q-learning for POMDPs

In this section we describe regularized agent-state-based Q-learning (RASQL), which is an online off-policy learning approach in which the agent acts according to a fixed behavioral policy to generate a sample path $(z_{1},a_{1},r_{1},z_{2},\dots)$ of agent states, actions, and rewards observed by a learning agent. We assume that the sampled rewards $r_{t}=r(s_{t},a_{t})$ are available to the agent during the learning process.

The learning agent uses a policy regularizer $\Omega\colon\Delta(\mathsf{A})\to\mathds{R}$ and maintains a regularized Q-table, which is arbitrarily initialized and then recursively updated as follows:

[TABLE]

where the learning rate sequence $\{\alpha_{t}(z,a)\}_{t\geq 1}$ is chosen such that $\alpha_{t}(z,a)=0$ whenever $(z,a)\neq(z_{t},a_{t})$ . For instance, if the policy regularizer is the entropy regularizer, then the above iteration corresponds to an agent-state-based version of soft-Q-learning [36]. The “greedy” policy at each time is given by $\pi_{t}(\cdot\mid z)=\nabla\Omega^{\star}(Q_{t}(z,\cdot))$ . Thus, for entropy regularization, it would correspond to soft-max based on $Q_{t}$ .

If the $\Omega^{\star}(Q_{t}(z_{t+1},\cdot))$ term in (III-B) is replaced by $\max_{a^{\prime}\in\mathsf{A}}Q_{t}(z_{t+1},a^{\prime})$ , the iteration in RASQL corresponds to agent-state-based Q-learning (ASQL):

[TABLE]

The convergence of ASQL and its variations have been recently studied in [14, 12, 11]. However, the analysis of ASQL does not include regularization. The main result of this paper is to characterize the convergence of RASQL.

IV Main result

We impose the following standard assumptions on the model.

Assumption 1

For all $(z,a)$ , the learning rates $\{\alpha_{t}(z,a)\}_{t\geq 1}$ are measurable with respect to the sigma-algebra generated by $(z_{1:t},a_{1:t})$ and satisfy $\alpha_{t}(z,a)=0$ if $(z,a)\neq(z_{t},a_{t})$ . Moreover, $\sum_{t\geq 1}\alpha_{t}(z,a)=\infty$ and $\sum_{t\geq 1}(\alpha_{t}(z,a))^{2}<\infty$ , almost surely.

Assumption 2

The behavior policy $\mu$ is such that the Markov chain $\{(S_{t},Y_{t},Z_{t},A_{t})\}_{t\geq 1}$ converges to a limiting distribution $\zeta_{\mu}$ , where $\sum_{(s,y)}\zeta_{\mu}(s,y,z,a)>0$ for all $(z,a)$ (i.e., all $(z,a)$ are visited infinitely often).

Assumption 1 is the standard assumption for convergence of stochastic approximation algorithms [37]. Assumption 2 ensures persistence of excitation and is a standard assumption in convergence analysis of Q-learning [38, 39, 10, 11, 12].

For ease of notation, we will continue to use $\zeta_{\mu}$ to denote the marginal and conditional distributions w.r.t. $\zeta_{\mu}$ . In particular, for marginals we use $\zeta_{\mu}(y,z,a)$ to denote $\sum_{s\in\mathsf{S}}\zeta_{\mu}(s,y,z,a)$ and so on; for conditionals, we use $\zeta_{\mu}(s|z,a)$ to denote $\zeta_{\mu}(s,z,a)/\zeta_{\mu}(z,a)$ and so on. Note that $\zeta_{\mu}(s,z,y,a)=\zeta_{\mu}(s,z)\mu(a|z)P(y|s,a)$ . Thus, we have that $\zeta_{\mu}(s|z,a)=\zeta_{\mu}(s|z)$ .

The key idea to characterize the convergence behavior is the following. Given the limiting distribution $\zeta_{\mu}$ , we can define an MDP with state space $\mathsf{Z}$ , action space $\mathsf{A}$ , and per-step reward $r_{\mu}\colon\mathsf{Z}\times\mathsf{A}\to\mathds{R}$ and dynamics $P_{\mu}\colon\mathsf{Z}\times\mathsf{A}\to\Delta(\mathsf{Z})$ given as follows:

[TABLE]

Now consider a regularized version of this MDP, where we regularize the policy using $\Omega$ . Let $Q_{\mu}$ denote the fixed point of the regularized Bellman operator corresponding to this regularized MDP, i.e., $Q_{\mu}$ is the unique fixed point of the following (see the discussion in Sec. II-B):

[TABLE]

Then, our main result is the following:

Theorem 1

Under Assumptions 1 and 2, the RASQL iteration (III-B) converges to $Q_{\mu}$ almost surely. □

Proof

The proof is given in appendix -A. ■

Remark 1

Note that Proposition 1 implies that the “greedy” regularized policy with respect to the limit point of $\{Q_{t}\}_{t\geq 1}$ is given by $\pi^{*}(\cdot\mid z)=\nabla\Omega^{\star}(Q_{\mu}(z,\cdot))$ , which typically lies in the interior of $\Delta(\mathsf{A})$ for each $z$ . Thus, the greedy policy is stochastic. This is a big advantage of RASQL compared to ASQL because in ASQL, the greedy policy corresponding to the limit point of the Q-learning iteration is deterministic. As shown in [40] (also see [9, 12]), in general for POMDPs with agent-state-based controllers, stochastic stationary policies can outperform deterministic stationary policies. □

V Regularized periodic Q-learning

The idea of periodic Q-learning has been explored in [12]. They show that periodic policies can perform better than stationary policies when the agent state is not an information state. Regularized Q-learning can be generalized by regularized periodic Q-learning, since taking the period $L=1$ reproduces the stationary setting.

Consider the convergence properties when we consider the following regularized periodic agent-state-based Q-learning (RePASQL) update for $\ell\in[L]$ .

[TABLE]

Assumption 3

For all $(\ell,z,a)$ , the learning rates $\{\alpha^{\ell}_{t}(z,a)\}_{t\geq 1}$ are measurable with respect to the sigma-algebra generated by $(z_{1:t},a_{1:t})$ and satisfy $\alpha^{\ell}_{t}(z,a)=0$ if $(\ell,z,a)\neq(\llbracket t\rrbracket,z_{t},a_{t})$ . Moreover, $\sum_{t\geq 1}\alpha^{\ell}_{t}(z,a)=\infty$ and $\sum_{t\geq 1}(\alpha^{\ell}_{t}(z,a))^{2}<\infty$ , almost surely.

Assumption 4

The behavior/exploration policy $\mu=\{\mu^{\ell}\}_{\ell\in[L]}$ is such that the Markov chain $\{(S_{t},Y_{t},Z_{t},A_{t})\}_{t\geq 1}$ converges to a limiting periodic distribution $\zeta_{\mu}^{\ell}$ , where $\sum_{(s,y)}\zeta_{\mu}^{\ell}(s,y,z,a)>0$ for all $(\ell,z,a)$ (i.e., all $(\ell,z,a)$ are visited infinitely often).

By considering this limiting distribution w.r.t. the original model’s rewards and dynamics, we can construct an artificial MDP on the agent state for each $\ell\in[L]$ , which has the following rewards and dynamics:

[TABLE]

Now we can extend the same techniques used in regularized MDPs II-B to this by defining a regularized Bellman operator $\mathcal{B}_{\mu}^{\ell}$ on an arbitrary Q-function $Q\in\mathds{R}^{\lvert\mathsf{Z}\rvert\times\lvert\mathsf{A}\rvert}$ as follows:

[TABLE]

Next define the composition of the sequence of $L$ Bellman operators corresponding to cycle $\ell$ as is done in [12].

[TABLE]

Then we can apply Prop. 1 to $\mathbb{B}_{\mu}^{\ell}$ . In addition, considering the periodicity of the operators, the same approach followed in [12] can be used to show that $\mathbb{B}_{\mu}^{\ell}$ is a contraction and therefore has a unique fixed point denoted by $Q^{\ell}_{\mu}$ which is given by

[TABLE]

Theorem 2

Under Assumptions 3 and 4, the RePASQL iteration (V) converges to $\{Q^{\ell}_{\mu}\}_{\ell\in[L]}$ almost surely. □

Proof

The proof is given in appendix -B. ■

VI Numerical example

In this section, we present an example to highlight the salient features of our results. First, we describe the POMDP model.

VI-A POMDP model

Consider a POMDP with $\mathsf{S}=\{0,1,2,3\},\mathsf{A}=\{0,1\},\mathsf{Y}=\{0,1\}$ and $\gamma=0.9$ . The start state distribution is given by

[TABLE]

Now consider the reward and transitions when $a=0$ :

[TABLE]

Note that $s,s^{\prime}$ (state, next state) corresponds to the rows, columns of $P$ , respectively. Next, when $a=1$

[TABLE]

Finally, we have the observations function which maps $s=\{0,3\}$ to $y=0$ and $s=\{1,2\}$ to $y=1$ .

VI-B Regularized agent-state-based Q-learning (RASQL) experiment

For the purpose of providing a simple illustration in this example, we fix the agent state to be the observation of the agent, i.e., $z_{t}=y_{t}$ . However, in general the theoretical results hold for the general agent-state update rule given in (8). Consider the following fixed exploration policy:

[TABLE]

Note that $z,a$ (observation, action) corresponds to the rows, columns of $\mu$ , respectively.

Using $\mu$ , we run $25$ random seeds on the given POMDP and we perform the RASQL update (III-B) with a regularization coefficient $(\beta)=1.0$ for $10^{5}$ timesteps/iterations. We plot the median and quartiles from $25$ seeds of the iterates $\{Q_{t}(z,a)\}_{t\geq 1}$ for each $(z,a)$ pair as well as their corresponding theoretical limits $Q_{\mu}(z,a)$ (computed using Theorem 1) are shown in Fig. 1. The salient features of these results are as follows:

•

RASQL converges to the theoretical limit predicted by Theorem 1.

•

The limit $Q_{\mu}$ depends on the exploration policy $\mu$ .

Thus, it can be seen from this example that we can precisely characterize the limits of convergence when using regularized Q-learning with an agent-state-based representation.

VI-C Regularized periodic agent-state-based Q-learning (RePASQL) experiment

Similar to the RASQL experiment, we fix the agent state to be the observation of the agent, i.e., $z_{t}=y_{t}$ . Consider the following fixed periodic exploration policy for period $L=2$ :

[TABLE]

Using $\mu^{\ell}$ , we run $25$ random seeds on the given POMDP and we perform the RePASQL update (V) with a regularization coefficient $(\beta)=1.0$ for $10^{5}$ timesteps/iterations. We plot the median and quartiles from $25$ seeds of the iterates $\{Q^{\ell}_{t}(z,a)\}_{t\geq 1}$ for each $(\ell,z,a)$ pair as well as their corresponding theoretical limits $Q^{\ell}_{\mu}(z,a)$ (computed using Theorem 2) are shown in Fig. 2. The salient features of these results are as follows:

•

RePASQL converges to the theoretical limit predicted by Theorem 2.

•

The limits $\{Q^{\ell}_{\mu}\}_{\ell\in[L]}$ depend on the periodic exploration policy $\{\mu^{\ell}\}_{\ell\in[L]}$ .

Thus, it can be seen from this example that we can precisely characterize the limits of convergence.

VII Conclusions

In this work, we present theoretical results on the convergence of regularized agent-state-based Q-learning (RASQL) under some standard assumptions from the literature. In particular, we show that: $1)$ RASQL converges and $2)$ we characterize the solution that RASQL converges to as a function of the model parameters and the choice of exploration policy. We illustrate these ideas on a small POMDP example and show that the Q-learning iterates of RASQL matches with the calculated theoretical limit. We also generalize these ideas to the periodic setting and demonstrate the theoretical and empirical convergence of RePASQL. Thus, in doing so we are able to understand how regularization works when combined with Q-learning for POMDPs that have an agent state that is not an information state.

A noteworthy issue with RASQL/RePASQL is that it inherits the limitations of its predecessor approaches of ASQL and PASQL. In particular, while we are able to prove convergence and characterize the converged solution in RASQL/RePASQL, we cannot guarantee the convergence to the optimal agent-state-based solution and this largely depends on the choice of exploration policy and the POMDP dynamics. Even so, seeing how regularization is an important component in several empirical works concerning POMDPs with agent states that are not an information state, we find it useful to establish some useful theoretical properties on the convergence of such algorithms.

-A Proof of Theorem 1

The proof argument for Theorem 1 is similar to the proof argument given in [10, 11, 13, 12].

Define an error function between the converged value and the Q-learning iteration $\Delta_{t+1}\coloneqq Q_{t+1}-Q_{\mu}$ . Then, combine (III-B), (13) and (11) as follows for all $(z,a)$ .

[TABLE]

where

[TABLE]

Note that we are adding the term $\gamma\Omega^{\star}(Q_{\mu}(Z_{t+1},\cdot))\mathds{1}_{\{Z_{t}=z,A_{t}=a\}}$ in $U^{1}_{t}(z,a)$ and subtracting it in $U^{2}_{t}(z,a)$ . We can now view (20) as a linear system with state $\Delta_{t}$ and three inputs $U^{0}_{t}(z,a),U^{1}_{t}(z,a)$ and $U^{2}_{t}(z,a)$ . Using the linearity, we can now split the state into three components $\Delta_{t+1}=X^{0}_{t+1}+X^{1}_{t+1}+X^{2}_{t+1}$ , where the components evolve as follows for $i\in\{0,1,2\}$ :

[TABLE]

We will now separately show each $\lVert X^{i}_{t}\rVert\to 0$ .

Convergence of component $X^{0}_{t}$

The proof for the convergence of component $X^{0}_{t}$ is similar to that given in [12].

Convergence of component $X^{1}_{t}$

The proof for the convergence of component $X^{1}_{t}$ is based on the argument given in [12]. Let $W_{t}$ denote the tuple $(S_{t},Z_{t},A_{t},S_{t+1},Z_{t+1},A_{t+1})$ . Note that $\{W_{t}\}_{t\geq 1}$ is a Markov chain and converges to a limiting distribution $\bar{\zeta}_{\mu}$ , where

[TABLE]

We use $\bar{\zeta}_{\mu}(s,z,a,\mathcal{S},\mathcal{Z},\mathcal{A})$ to denote the marginalization over the “future states” and a similar notation for other marginalizations. Note that $\bar{\zeta}_{\mu}(s,z,a,\mathcal{S},\mathcal{Z},\mathcal{A})=\zeta_{\mu}(s,z,a)$ .

Define $V_{t}$ as the value function associated with $Q_{t}$ , i.e., $V_{t}(z)\coloneqq\Omega^{\star}(Q_{t}(z,\cdot))$ . Fix $(z_{\circ},a_{\circ})\in\times\mathsf{Z}\times\mathsf{A}$ and define

[TABLE]

Then the process $\{X^{1}_{t}(z,a)\}_{t\geq 1}$ is given by the stochastic iteration

[TABLE]

As argued earlier, the process $\{W_{t}\}_{t\geq 1}$ is a Markov chain. Due to Assm. 1, the learning rate $\alpha_{t}(z_{\circ},a_{\circ})$ is measurable with respect to the sigma-algebra generated by $(Z_{1:t},A_{1:t})$ and is therefore also measurable with respect to the sigma-algebra generated by $W_{1:t}$ . Thus, the learning rates $\{\alpha_{t}(z_{\circ},a_{\circ})\}_{t\geq 1}$ satisfy the conditions of Theorem $2.7$ from [41]. Therefore, the theorem implies that $\{X^{1}_{t}(z_{\circ},a_{\circ})\}_{t\geq 1}$ converges a.s. to the following limit

[TABLE]

where the last step follows from the fact that $\bar{\zeta}_{\mu}(\mathsf{S},z_{\circ},a_{\circ},\mathsf{S},\mathcal{Z},\mathcal{A})=\zeta_{\mu}(z_{\circ},a_{\circ})$ and $\bar{\zeta}_{\mu}(\mathsf{S},z_{\circ},a_{\circ},\mathsf{S},z^{\prime},\mathsf{A})=\zeta_{\mu}(z_{\circ},a_{\circ})P_{\mu}(z^{\prime}|z_{\circ},a_{\circ})$ .

Convergence of component $X^{2}_{t}$

The convergence of the $X^{2}_{t}$ component is based on [11, 12] but requires some additional considerations due to the regularization term. We start by defining:

[TABLE]

In the previous steps, we have shown that $\lVert X^{i}_{t}\rVert\to 0$ a.s., for $i\in\{0,1\}$ . Thus, we have that $\lVert X^{0}_{t}+X^{1}_{t}\rVert\to 0$ a.s. Arbitrarily fix an $\epsilon>0$ . Therefore, there exists a set $\Omega^{1}$ of measure one and a constant $T(\omega,\epsilon)$ such that for $\omega\in\Omega^{1}$ , all $t>T(\omega,\epsilon)$ , and $(z,a)\in\times\mathsf{Z}\times\mathsf{A}$ , we have

[TABLE]

Now pick a constant $C$ such that

[TABLE]

Suppose for some $t>T(\omega,\epsilon)$ , $\lVert X^{2}_{t}\rVert>C\epsilon$ . Then, for $(z,a)\in\mathsf{Z}\times\mathsf{A}$ ,

[TABLE]

where $(a)$ follows from the fact that we replace the argmax $\pi^{\star}$ with a different argument $\pi_{t}$ in the second term, $(b)$ follows from maximizing over all realizations of $Z_{t+1}$ and $a\in\mathsf{A}$ , $(c)$ follows from (23), $(d)$ follows from $\lVert X^{2}_{t}\rVert>C\epsilon$ , $(e)$ follows from (24). Thus, for any $t>T(\omega,\epsilon)$ and $\lVert X^{2}_{t}\rVert>C\epsilon$ :

[TABLE]

Hence, when $\lVert X^{2}_{t}\rVert>C\epsilon$ , it decreases monotonically with time. Hence, there are two possibilities: either

$\lVert X^{2}_{t}\rVert$ always remains above $C\epsilon$ ; or 2. 2.

it goes below $C\epsilon$ at some stage.

We consider these two possibilities separately.

Possibility (i): $\lVert X^{2}_{t}\rVert$ always remains above $C\epsilon$

We will now prove that $\lVert X^{2}_{t}\rVert$ cannot remain above $C\epsilon$ forever. The proof is by contradiction. Suppose $\lVert X^{2}_{t}\rVert$ remains above $C\epsilon$ forever. As argued earlier, this implies that $\lVert X^{2}_{t}\rVert$ , $t\geq T(\omega,\epsilon)$ , is a strictly decreasing sequence, so it must be bounded from above. Let $B^{(0)}$ be such that $\lVert X^{2}_{t}\rVert\leq B^{(0)}$ for all $t\geq T(\omega,\epsilon)$ . Eq. (25i) implies that $\lVert U^{2}_{t}\rVert<\kappa B^{(0)}$ . Then, we have for all $(z,a)\in\mathsf{Z}\times\mathsf{A}$ that

[TABLE]

which implies that $\lVert X^{2}_{t}\rVert\leq\lVert M^{(0)}_{t}\rVert$ , where $\{M^{(0)}_{t}\}_{t\geq T(\omega,\epsilon)}$ is a sequence given by

[TABLE]

Theorem $2.4$ from [41] implies that $M^{(0)}_{t}(z,a)\to\kappa B^{(0)}$ and hence $\lVert M^{(0)}_{t}\rVert\to\kappa B^{(0)}$ . Now pick an arbitrary $\bar{\epsilon}\in(0,(1-\kappa)C\epsilon)$ . Thus, there exists a time $T^{(1)}=T^{(1)}(\omega,\epsilon,\bar{\epsilon})$ such that for all $t>T^{(1)}$ , $\lVert M^{(0)}_{t}\rVert\leq B^{(1)}\coloneqq\kappa B^{(0)}+\bar{\epsilon}$ . Since $\lVert X^{2}_{t}\rVert$ is bounded by $\lVert M^{(0)}_{t}\rVert$ , this implies that for all $t>T^{(1)}$ , $\lVert X^{2}_{t}\rVert\leq B^{(1)}$ and, by (25i), $\lVert U^{2}_{t}\rVert\leq\kappa B^{(1)}$ . By repeating the above argument, there exists a time $T^{(2)}$ such that for all $t\geq T^{(2)}$ ,

[TABLE]

and so on. By (24), $\kappa<1$ and $\bar{\epsilon}$ is chosen to be less than $C\epsilon$ . So eventually, $B^{(m)}\coloneqq\kappa^{m}B^{(0)}+\kappa^{m-1}\bar{\epsilon}+\cdots+\bar{\epsilon}$ must get below $C\epsilon$ for some $m$ , contradicting the assumption that $\lVert X^{2}_{t}\rVert$ remains above $C\epsilon$ forever.

Possibility (ii): $\lVert X^{2}_{t}\rVert$ goes below $C\epsilon$ at some stage

Suppose that there is some $t>T(\omega,\epsilon)$ such that $\lVert X^{2}_{t}\rVert<C\epsilon$ . Then (25g), (25h) and (24) imply that

[TABLE]

Therefore,

[TABLE]

where the last inequality uses the fact that both $\lVert U^{2}_{t}\rVert$ and $\lVert X^{2}_{t+1}\rVert$ are both below $C\epsilon$ . Thus, we have that

[TABLE]

Hence, once $\lVert X^{2}_{t+1}\rVert$ goes below $C\epsilon$ , it stays there.

Implication

We have show that for sufficiently large $t>T(\omega,\epsilon)$ , $X^{2}_{t}(z,a)<C\epsilon$ . Since $\epsilon$ is arbitrary, this means that for all realizations $\omega\in\Omega^{1}$ , $\lVert X^{2}_{t}\rVert\to 0$ . Thus,

[TABLE]

Putting everything together

Recall that we initially defined $\Delta_{t}=Q_{t}-Q_{\mu}$ and we split $\Delta_{t}=X^{0}_{t}+X^{1}_{t}+X^{2}_{t}$ . Steps $a)$ and $b)$ together show that $\lVert X^{0}_{t}+X^{1}_{t}\rVert\to 0$ , a.s. and Step $c)$ (31) shows us that $\lVert X^{2}_{t}\rVert\to 0$ , a.s. Thus, by the triangle inequality,

[TABLE]

which establishes that $Q_{t}\to Q_{\mu}$ , a.s.

-B Proof of Theorem 2

The proof follows a similar style used in [12]. Define an error function between the converged value and the Q-learning iteration $\Delta^{\ell}_{t+1}\coloneqq Q^{\ell}_{t+1}-Q^{\ell}_{\mu}$ . Then, combine (V), (19) and (15) as follows for all $(z,a)$ .

[TABLE]

where

[TABLE]

Note that we are adding the term $\gamma\Omega^{\star}(Q^{\llbracket\ell+1\rrbracket}_{\mu}(Z_{t+1},\cdot))\mathds{1}_{\{Z_{t}=z,A_{t}=a\}}$ in $U^{\ell,1}_{t}(z,a)$ and subtracting it in $U^{\ell,2}_{t}(z,a)$ . We can now view (33) as a linear system with state $\Delta^{\ell}_{t}$ and three inputs $U^{\ell,0}_{t}(z,a),U^{\ell,1}_{t}(z,a)$ and $U^{\ell,2}_{t}(z,a)$ . Using the linearity, we can now split the state into three components $\Delta^{\ell}_{t+1}=X^{\ell,0}_{t+1}+X^{\ell,1}_{t+1}+X^{\ell,2}_{t+1}$ , where the components evolve as follows for $i\in\{0,1,2\}$ :

[TABLE]

We will now separately show each $\lVert X^{\ell,i}_{t}\rVert\to 0$ .

Convergence of component $X^{\ell,0}_{t}$

The proof for the convergence of component $X^{\ell,0}_{t}$ is similar to that given in [12]. The only difference from the RASQL proof of Theorem 1 is that the convergence has to established for each $\ell\in[L]$ in $\lVert X^{\ell,i}_{t}\rVert\to 0$ . Note that this case is identical to the periodic case of [12], since the component $\lVert X^{\ell,i}_{t}\rVert$ does not involve any of the regularized terms.

The main result that is applied here is proposition $4$ from [12], which establishes the exact convergence of $\lVert X^{\ell,i}_{t}\rVert$ when the underlying Markov chain is periodic.

Convergence of component $X^{\ell,1}_{t}$

The proof for the convergence of component $X^{\ell,1}_{t}$ is based on the argument given in [12]. Let $W_{t}$ denote the tuple $(S_{t},Z_{t},A_{t},S_{t+1},Z_{t+1},A_{t+1})$ . Note that $\{W_{t}\}_{t\geq 1}$ is a periodic Markov chain and converges to a periodic limiting distribution $\bar{\zeta}^{\ell}_{\mu}$ , where

[TABLE]

We use $\bar{\zeta}^{\ell}_{\mu}(s,z,a,\mathcal{S},\mathcal{Z},\mathcal{A})$ to denote the marginalization over the “future states” and a similar notation for other marginalizations. Note that $\bar{\zeta}^{\ell}_{\mu}(s,z,a,\mathcal{S},\mathcal{Z},\mathcal{A})=\zeta_{\mu}^{\ell}(s,z,a)$ .

Define $V^{\llbracket\ell+1\rrbracket}_{t}$ as the value function associated with $Q^{\llbracket\ell+1\rrbracket}_{t}$ , i.e., $V^{\llbracket\ell+1\rrbracket}_{t}(z)\coloneqq\Omega^{\star}(Q^{\llbracket\ell+1\rrbracket}_{t}(z,\cdot))$ . Fix $(z_{\circ},a_{\circ})\in\times\mathsf{Z}\times\mathsf{A}$ and define

[TABLE]

Then the process $\{X^{\ell,1}_{t}(z,a)\}_{t\geq 1}$ is given by the stochastic iteration

[TABLE]

As mentioned earlier, the process $\{W_{t}\}_{t\geq 1}$ is a periodic Markov chain. From the periodic Markov chain result of proposition $4$ from [12], we have that: $\{X^{\ell,1}_{t}(z_{\circ},a_{\circ})\}_{t\geq 1}$ converges a.s. to the following periodic limits

[TABLE]

where the last step follows from the fact that $\bar{\zeta}^{\ell}_{\mu}(\mathsf{S},z_{\circ},a_{\circ},\mathsf{S},\mathcal{Z},\mathcal{A})=\zeta_{\mu}^{\ell}(z_{\circ},a_{\circ})$ and $\bar{\zeta}^{\ell}_{\mu}(\mathsf{S},z_{\circ},a_{\circ},\mathsf{S},z^{\prime},\mathsf{A})=\zeta_{\mu}^{\ell}(z_{\circ},a_{\circ})P^{\ell}_{\mu}(z^{\prime}|z_{\circ},a_{\circ})$ .

Convergence of component $X^{\ell,2}_{t}$

The convergence of the $X^{\ell,2}_{t}$ component is based on [11, 12] but requires some additional considerations due to the regularization term. We start by defining:

[TABLE]

In the previous steps, we have shown that $\lVert X^{\ell,i}_{t}\rVert\to 0$ a.s., for $i\in\{0,1\}$ . Thus, we have that $\lVert X^{\ell,0}_{t}+X^{\ell,1}_{t}\rVert\to 0$ a.s. Arbitrarily fix an $\epsilon>0$ . Therefore, there exists a set $\Omega^{1}$ of measure one and a constant $T(\omega,\epsilon)$ such that for $\omega\in\Omega^{1}$ , all $t>T(\omega,\epsilon)$ , and $(z,a)\in\times\mathsf{Z}\times\mathsf{A}$ , we have

[TABLE]

Now pick a constant $C$ such that

[TABLE]

Suppose for some $t>T(\omega,\epsilon)$ , $\lVert X^{\ell,2}_{t}\rVert>C\epsilon$ . Then, for $(\ell,z,a)\in L\times\mathsf{Z}\times\mathsf{A}$ ,

[TABLE]

where $(a)$ follows from the fact that we replace the argmax $\pi^{\ell,\star}$ with a different argument $\pi^{\ell}_{t}$ in the second term, $(b)$ follows from maximizing over all realizations of $Z_{t+1}$ and $a\in\mathsf{A}$ , $(c)$ follows from (36), $(d)$ follows from $\lVert X^{\ell,2}_{t}\rVert>C\epsilon$ , $(e)$ follows from (37). Thus, for any $t>T(\omega,\epsilon)$ and $\lVert X^{\ell,2}_{t}\rVert>C\epsilon$ :

[TABLE]

Hence, when $\lVert X^{\ell,2}_{t}\rVert>C\epsilon$ , it decreases monotonically with time. Hence, there are two possibilities: either

$\lVert X^{\ell,2}_{t}\rVert$ always remains above $C\epsilon$ ; or 2. 2.

it goes below $C\epsilon$ at some stage.

These two cases must be considered separately. The proof follows exactly the same steps in the proof of theorem 1 given in appendix -A, which finally gives us:

[TABLE]

Putting everything together Recall that we initially defined $\Delta^{\ell}_{t}=Q^{\ell}_{t}-Q^{\ell}_{\mu}$ and we split $\Delta^{\ell}_{t}=X^{\ell,0}_{t}+X^{\ell,1}_{t}+X^{\ell,2}_{t}$ . Steps $a)$ and $b)$ together show that $\lVert X^{\ell,0}_{t}+X^{\ell,1}_{t}\rVert\to 0$ , a.s. and Step $c)$ (39) shows us that $\lVert X^{\ell,2}_{t}\rVert\to 0$ , a.s. Thus, by the triangle inequality,

[TABLE]

which establishes that $Q^{\ell}_{t}\to Q^{\ell}_{\mu}$ , a.s.

Bibliography41

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] K. J. Åström, “Optimal control of Markov processes with incomplete state information I,” Journal of Mathematical Analysis and Applications , vol. 10, pp. 174–205, 1965.
2[2] R. D. Smallwood and E. J. Sondik, “The optimal control of partially observable Markov processes over a finite horizon,” Operations Research , vol. 21, no. 5, pp. 1071–1088, 1973.
3[3] J. Subramanian, A. Sinha, R. Seraj, and A. Mahajan, “Approximate information state for approximate planning and reinforcement learning in partially observed systems,” J. Mach. Learn. Res. , vol. 23, no. 12, pp. 1–83, 2022.
4[4] M. J. Hausknecht and P. Stone, “Deep recurrent Q-learning for partially observable MD Ps.” in AAAI Fall Symposia , vol. 45, 2015, p. 141.
5[5] M. Igl, L. Zintgraf, T. A. Le, F. Wood, and S. Whiteson, “Deep variational reinforcement learning for POMD Ps,” in Int. Conf. Mach. Learn. PMLR, 2018, pp. 2117–2126.
6[6] P. Zhu, X. Li, P. Poupart, and G. Miao, “On improving deep reinforcement learning for POMD Ps,” ar Xiv:1704.07978 , 2017.
7[7] L. Meng, R. Gorbet, and D. Kulić, “Memory-based deep reinforcement learning for POMD Ps,” in Int. Conf. Intell. Robots Syst. IEEE, 2021, pp. 5619–5626.
8[8] S. Dong, B. Van Roy, and Z. Zhou, “Simple agent, complex environment: Efficient reinforcement learning with agent states,” J. Mach. Learn. Res. , vol. 23, no. 255, pp. 1–54, 2022.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

Convergence of regularized agent-state-based Q-learning in POMDPs

Abstract

I Introduction

II Background

II-A Legendre-Fenchel transform (convex conjugate)

Definition 1

Lemma 1** (Based on [33, 34])**

II-B Regularized MDPs

Proposition 1** (Based on [23])**

III System model and regularized Q-learning for POMDPs

III-A Model for POMDPs

Definition 2** (Agent state)**

III-B Regularized agent-state-based Q-learning for POMDPs

IV Main result

Assumption 1

Assumption 2

Theorem 1

Proof

Remark 1

V Regularized periodic Q-learning

Assumption 3

Assumption 4

Theorem 2

Proof

VI Numerical example

VI-A POMDP model

VI-B Regularized agent-state-based Q-learning (RASQL) experiment

VI-C Regularized periodic agent-state-based Q-learning (RePASQL) experiment

VII Conclusions

-A Proof of Theorem 1

Convergence of component Xt0X^{0}_{t}Xt0​

Convergence of component Xt1X^{1}_{t}Xt1​

Convergence of component Xt2X^{2}_{t}Xt2​

Implication

-B Proof of Theorem 2

Convergence of component Xtℓ,0X^{\ell,0}_{t}Xtℓ,0​

Convergence of component Xtℓ,1X^{\ell,1}_{t}Xtℓ,1​

Convergence of component Xtℓ,2X^{\ell,2}_{t}Xtℓ,2​

Lemma 1 (Based on [33, 34])

Proposition 1 (Based on [23])

Definition 2 (Agent state)

Convergence of component $X^{0}_{t}$

Convergence of component $X^{1}_{t}$

Convergence of component $X^{2}_{t}$

Convergence of component $X^{\ell,0}_{t}$

Convergence of component $X^{\ell,1}_{t}$

Convergence of component $X^{\ell,2}_{t}$