Policy Dispersion in Non-Markovian Environment

Bohao Qu; Xiaofeng Cao; Jielong Yang; Hechang Chen; Chang Yi; Ivor; W.Tsang; Yew-Soon Ong

arXiv:2302.14509·cs.LG·June 4, 2024

Policy Dispersion in Non-Markovian Environment

Bohao Qu, Xiaofeng Cao, Jielong Yang, Hechang Chen, Chang Yi, Ivor, W.Tsang, Yew-Soon Ong

PDF

Open Access

TL;DR

This paper introduces a transformer-based policy dispersion method for non-Markovian environments, enabling the learning of diverse, expressive policies that improve robustness and adaptability in reinforcement learning tasks.

Contribution

It proposes a novel policy dispersion scheme using transformer-based embeddings and a positive definite dispersion matrix to enhance policy diversity in non-Markovian settings.

Findings

01

Diverse policies lead to more robust performance.

02

The dispersion scheme outperforms recent baselines.

03

Positive definite dispersion matrix enlarges policy disagreements.

Abstract

Markov Decision Process (MDP) presents a mathematical framework to formulate the learning processes of agents in reinforcement learning. MDP is limited by the Markovian assumption that a reward only depends on the immediate state and action. However, a reward sometimes depends on the history of states and actions, which may result in the decision process in a non-Markovian environment. In such environments, agents receive rewards via temporally-extended behaviors sparsely, and the learned policies may be similar. This leads the agents acquired with similar policies generally overfit to the given task and can not quickly adapt to perturbations of environments. To resolve this problem, this paper tries to learn the diverse policies from the history of state-action pairs under a non-Markovian environment, in which a policy dispersion scheme is designed for seeking diverse policy…

Tables8

Table 1. TABLE I: The average number of times the five candidate policies for each method reach the target in the three versions of FrozenLake-v1 environments.

Environment	D2PED-DQN	P3S-DQN	DvD-DQN	QD-DQN
FrozenLake-v1 $4 \times 4$ (4holes)	2175	2034	1874	1582
FrozenLake-v1 $5 \times 5$ (10holes)	1895	1798	1603	1489
FrozenLake-v1 $8 \times 8$ (28holes)	1040	770	689	634

Table 2. TABLE II: The average reward of five candidate policies for each method in two mutli-mode environments.

Envrionment	Mode	D2PED-TD3	P3S-TD3	DvD-TD3	QD-TD3
HalfCheetah	Forward	5325	5014	-3591	4897
HalfCheetah	Backward	6716	-4452	6380	6045
Ant	Forward	4740	4454	4437	4033
Ant	Backward	4734	-3273	4090	4108

Table 3. TABLE III: The average number of times to reach the target of D2PED equipped with different policy embedding algorithms in the three version of FrozenLake-v1 environments.

Environment	Policy Representation Module	Auto-Encoder	Behavior Embedding
FrozenLake-v1 $4 \times 4$ (4holes)	2175	1257	848
FrozenLake-v1 $5 \times 5$ (10holes)	1895	808	653
FrozenLake-v1 $8 \times 8$ (28holes)	1040	688	301

Table 4. TABLE IV: The number of times to reach the target of D2PED with different number of candidate policies M 𝑀 M in three FrozenLake-v1 environments.

Environment	M = 3	M = 5	M = 7	M = 9
FrozenLake-v1 $4 \times 4$ (4holes)	2243	2175	2120	2092
FrozenLake-v1 $5 \times 5$ (10holes)	1812	1895	1862	1809
FrozenLake-v1 $8 \times 8$ (28holes)	1021	1040	1012	995

Table 5. TABLE V: Policy embedding parameter configurations for every environment.

Hyperparameters

Point-v1

FrozenLake-v1

Humanoid-v2

HalfCheetah-v3

Hopper-v3

embedding dim

30

125

120

85

40

attention heads

4

8

4

learning rate

3e-4

attention depth

4

6

4

dropout

0.1

Table 6. TABLE VI: Parameter configurations for the Point-v1 environment.

Hyperparameters

Point-v1

batch size

2048

minibatch size

256

λ

0.97

γ

0.995

learning rate

3e-4

Table 7. TABLE VII: Parameter configurations for three FrozenLake-v1 environments.

Hyperparameters

FrozenLake-v1

learning rate

0.001

γ

0.9

epsilon greedy

0.9

batch size

32

Table 8. TABLE VIII: Parameter configurations for the three MuJoCo environments.

Hyperparameters

Humanoid-v2

HalfCheetah-v3

Hopper-v3

learning rate

3e-4

γ

0.99

batch size

256

128

target noise

0.2

buffer size

1e6

Equations34

J (π_{θ}) = E_{τ \sim π_{θ}} [R (τ)] .

J (π_{θ}) = E_{τ \sim π_{θ}} [R (τ)] .

J (π_{θ}) = (1 - λ) E_{τ \sim π_{θ}} [R (τ)] + λ Div (π_{θ})

J (π_{θ}) = (1 - λ) E_{τ \sim π_{θ}} [R (τ)] + λ Div (π_{θ})

i_{m} \sim Cat ((σ (r_{i, m})^{\frac{1}{T}} / j = 1 \sum N σ (r_{j, m})^{\frac{1}{T}})_{i = 1}^{N}),

i_{m} \sim Cat ((σ (r_{i, m})^{\frac{1}{T}} / j = 1 \sum N σ (r_{j, m})^{\frac{1}{T}})_{i = 1}^{N}),

v_{i, m} = MSA (LN (ω_{i, m}^{t})_{t = 1}^{T} + E_{pos}) + (ω_{i, m}^{t})_{t = 1}^{T},

v_{i, m} = MSA (LN (ω_{i, m}^{t})_{t = 1}^{T} + E_{pos}) + (ω_{i, m}^{t})_{t = 1}^{T},

ξ_{i, m} = CL (v_{i, m}),

ξ_{i, m} = CL (v_{i, m}),

L_{ϕ} = - \frac{1}{N} \frac{1}{M} i = 1 \sum N m = 1 \sum M (y_{m}^{T} lo g (ξ_{i, m})),

L_{ϕ} = - \frac{1}{N} \frac{1}{M} i = 1 \sum N m = 1 \sum M (y_{m}^{T} lo g (ξ_{i, m})),

J (Π) = {m = 1 \sum M [(1 - β) J (π_{m}) + β Div ({v_{m}}_{m = 1}^{M})]}

J (Π) = {m = 1 \sum M [(1 - β) J (π_{m}) + β Div ({v_{m}}_{m = 1}^{M})]}

0 < Div ({v_{m}}_{m = 1}^{M}) = det (S) \leq i = 1 \prod K s_{ii} = Λ,

0 < Div ({v_{m}}_{m = 1}^{M}) = det (S) \leq i = 1 \prod K s_{ii} = Λ,

(1 - β) m = 1 \sum M \tilde{J} (\tilde{π}_{m}) + β Div ({\tilde{v}_{m}}_{m = 1}^{M}) \leq (1 - β) M R - (1 - β) Δ + β Λ.

(1 - β) m = 1 \sum M \tilde{J} (\tilde{π}_{m}) + β Div ({\tilde{v}_{m}}_{m = 1}^{M}) \leq (1 - β) M R - (1 - β) Δ + β Λ.

(1 - β) m = 1 \sum M \tilde{J} (π_{m}) + β Div ({v_{m}}_{m = 1}^{M}) \geq MR.

(1 - β) m = 1 \sum M \tilde{J} (π_{m}) + β Div ({v_{m}}_{m = 1}^{M}) \geq MR.

(1 - β) m = 1 \sum M \tilde{J} (\tilde{π}_{m}) + β Div ({\tilde{v}_{m}}_{m = 1}^{M}) < (1 - β) m = 1 \sum M \tilde{J} (π_{m}) + β Div ({v_{m}}_{m = 1}^{M}) .

(1 - β) m = 1 \sum M \tilde{J} (\tilde{π}_{m}) + β Div ({\tilde{v}_{m}}_{m = 1}^{M}) < (1 - β) m = 1 \sum M \tilde{J} (π_{m}) + β Div ({v_{m}}_{m = 1}^{M}) .

V = v_{11} v_{21} ⋮ v_{M 1} v_{12} v_{22} ⋮ v_{M 2} \dots \dots ⋱ \dots v_{1 K} v_{2 K} ⋮ v_{M K} .

V = v_{11} v_{21} ⋮ v_{M 1} v_{12} v_{22} ⋮ v_{M 2} \dots \dots ⋱ \dots v_{1 K} v_{2 K} ⋮ v_{M K} .

s_{ik} = \frac{1}{M - 1} m = 1 \sum M (v_{mi} - \overset{v}{ˉ}_{i}) (v_{mk} - \overset{v}{ˉ}_{k}),

s_{ik} = \frac{1}{M - 1} m = 1 \sum M (v_{mi} - \overset{v}{ˉ}_{i}) (v_{mk} - \overset{v}{ˉ}_{k}),

S = s_{11} s_{12} ⋮ s_{1 K} s_{12} s_{22} ⋮ s_{2 K} \dots \dots ⋱ \dots s_{1 K} s_{2 K} ⋮ s_{K K} .

S = s_{11} s_{12} ⋮ s_{1 K} s_{12} s_{22} ⋮ s_{2 K} \dots \dots ⋱ \dots s_{1 K} s_{2 K} ⋮ s_{K K} .

H = \frac{K}{2} (1 + ln (2 π)) + \frac{1}{2} ln (det (S)),

H = \frac{K}{2} (1 + ln (2 π)) + \frac{1}{2} ln (det (S)),

{v : (v - \overset{v}{ˉ})^{'} S^{- 1} (v - \overset{v}{ˉ}) = c^{2}},

{v : (v - \overset{v}{ˉ})^{'} S^{- 1} (v - \overset{v}{ˉ}) = c^{2}},

Volume of {v : (v - \overset{v}{ˉ})^{'} S^{- 1} (v - \overset{v}{ˉ}) \leq c^{2}} = a_{x} (det (S))^{\frac{1}{2}} c^{x}

Volume of {v : (v - \overset{v}{ˉ})^{'} S^{- 1} (v - \overset{v}{ˉ}) \leq c^{2}} = a_{x} (det (S))^{\frac{1}{2}} c^{x}

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics

Full text

Policy Dispersion in Non-Markovian Environment

Bohao Qu, Xiaofeng Cao, Jielong Yang, Hechang Chen, Chang Yi, Ivor W.Tsang, and Yew-Soon Ong

Bohao Qu, Xiaofeng Cao, Jielong Yang, Hechang Chen, and Yi Chang are with the School of Artificial Intelligence, Jilin University, Changchun, Jilin 130012, China, and also with the Engineering Research Center of Knowledge-Driven Human-Machine Intelligence, Ministry of Education, China.

E-mail: [email protected], $\{$ xiaofengcao, chenhc, yichang $\}$ @jlu.edu.cn, [email protected]. Ivor W. Tsang is with the A*STAR Centre for Frontier AI Research, Singapore 138632, and also with the Australian Artificial Intelligence Institute, University of Technology Sydney, Ultimo, NSW 2007, Australia.

E-mail: ivor $\_$ [email protected]. Yew-Soon Ong is with Nanyang Technological University, Singapore. E-mail: [email protected]. He is also with A*STAR Centre for Frontier AI Research, Singapore. E-mail: Ong $\_$ Yew $\_$ [email protected]. xxxx

Manuscript received April 19, 2005; revised August 26, 2015.

Abstract

Markov Decision Process (MDP) presents a mathematical framework to formulate the learning processes of agents in reinforcement learning. MDP is limited by the Markovian assumption that a reward only depends on the immediate state and action. However, a reward sometimes depends on the history of states and actions, which may result in the decision process in a non-Markovian environment. In such environments, agents receive rewards via temporally-extended behaviors sparsely, and the learned policies may be similar. This leads the agents acquired with similar policies generally overfit to the given task and can not quickly adapt to perturbations of environments. To resolve this problem, this paper tries to learn the diverse policies from the history of state-action pairs under a non-Markovian environment, in which a policy dispersion scheme is designed for seeking diverse policy representation. Specifically, we first adopt a transformer-based method to learn policy embeddings. Then, we stack the policy embeddings to construct a dispersion matrix to induce a set of diverse policies. Finally, we prove that if the dispersion matrix is positive definite, the dispersed embeddings can effectively enlarge the disagreements across policies, yielding a diverse expression for the original policy embedding distribution. Experimental results show that this dispersion scheme can obtain more expressive diverse policies, which then derive more robust performance than recent learning baselines under various learning environments.

Index Terms:

Policy diversity, non-Markovian environment, reinforcement learning, policy embedding, policy dispersion.

1 Introduction

Reinforcement Learning (RL) provides a mathematical formalism for an agent to learn a policy maximizing the expected cumulative rewards in a given environment. RL has achieved great success in learning an efficient policy for a given task, including board games[2, 3], poker games[4, 5], video games[6, 9, 10, 8, 7], autonomous control[11, 14, 12, 13], and robotic manipulation[15, 16, 17, 18]. A reinforcement learning environment is typically formulated as a Markov decision process (MDP). And an MDP is limited by the Markovian assumption that a reward only depends on the immediate state and action. The policy is constantly updated during the learning process, and a policy updating trajectory will be generated in the policy space.

To maximize the expected cumulative reward under MDP, there are many qualitatively different optimal or near-optimal policies for the agent’s decision. Considering the insufficiency of prior knowledge from the environment, learning diverse policies can avoid converging to only one local solution and handle more irregular situations. A series of existing Quality-Diversity (QD) [19] related evolutionary algorithms have achieved good performance in exploring diverse behaviors and diverse policies [26, 20, 21, 22, 23, 25, 24]. The MAP-Elites algorithms [28, 27] solve this problem by discretizing the behavior description space into a grid of cells. Some theoretical conclusions are also developed to ensure that diverse policies are not obtained by sacrificing their effectiveness. The DvD [29] algorithm proves that under tabular MDP, multiple distinct optimal solutions can be obtained by maximizing the proposed loss function. The ridge rider algorithm [30] proposes to use the eigenvectors of the Hessian matrix to discover diverse local optima with theoretical guarantees. [31] theoretically shows that maximizing the diversity metric based on the decision point process can guarantee to enlarge the convex polytopes spanned by the policies of agents. However, the above methods neglect to capture reward-relevant historical information, which is essential to non-Markovian environments.

In most real-world problems, the rewards do not depend on the immediate state and the chosen action but rather on the agent’s visited states and performed actions. In such environments, the Markovian assumption (MA) does not hold, and the reward function has a temporal nature [32], that the agent receives its rewards for complex, temporally-extended behaviors sparsely. For example, a robot should be rewarded for delivering coffee only if a user previously requested it. The example describes a non-Markovian reward recently studied in [33, 34, 35]. The non-Markovian rewards typically offer sparse supervision signals for the training process, which means the agent will not get rewards or get zero rewards in most timesteps. This leads the agent to acquire similar policies. When the agent achieves the goal and gets a non-Markovian reward, the agent generally tends to memorize the action trajectory and overfit the given task, and this leads the agent can not quickly adapt to perturbations of environments.

Proposal/Motivation To resolve this problem, the key is to learn diverse policies from the history of the state-action pairs under a non-Markovian environment. An inherent motivation is that maintaining a set of diverse policies leads to better exploration and quick adaptation to perturbations of environments [36]. In an RL task, the candidate policies of the agent usually have consistent initialization states, that is, their embeddings are aggregated in a unified region in the policy embedding space. In this setting, learning diverse policies requires the policy embeddings to be sufficiently dispersed. During this potential dispersion process, the embeddings are repeatedly constructed as the policy update progresses, forming different dispersion trajectories (Fig. 1). To obtain diverse dispersion trajectories, maximizing the dispersion disagreements of dynamic policy updating trajectories could be a feasible and effective strategy.

Technical statement Motivate by the above dispersion perspective, this paper presents a policy-efficient method Discovering Diverse Policy via Embedding Dispersion (D2PED), which can efficiently learn high-quality policies with diverse behaviors in non-Markovian environments. Specifically, we design a policy dispersion scheme to disperse policy embeddings along different trajectories as the policy update progresses. Then the policy embeddings are employed to construct a * dispersion matrix*, which is used to measure the diversity of the candidate policies and guides the update of the policies. Learning effective embeddings for the policies that capture the features of the non-Markovian environment is key to solving this problem. We use a Transformer-based architecture [37] to capture long-horizon dependencies for histories of state-action pairs, which include all reward-relevant historical information. Moreover, in non-Markovian environments, the agent receives its reward sparsely for complex actions over a long period of time [32], which is not conducive to training the Transformer-based policy representation model. We construct a sample categorical distribution to sample higher cumulative reward trajectories with a higher probability. We also give the theoretical conditions such that the policy diversity will not reduce their effectiveness.

Contributions Our contributions are summarized as follows:

$\bullet$

New Inisght We present a novel RL perspective that models the policy update as embedding dispersion, deriving different trajectories to explore policy diversity.

$\bullet$

New Scheme Motivated by our new insight, we design a policy-efficient method Discovering Diverse Policy via Embedding Dispersion (D2PED), which can efficiently learn high-quality policies with diverse behaviors in non-Markovian environments.

$\bullet$

New Technique Under the new framework, we propose a policy dispersion scheme to disperse policy embeddings along different trajectories as the policy update progresses and use the embeddings to construct a dispersion matrix that guides the learning of diverse policies.

$\bullet$

Effective Analysis To guarantee the diversity of candidate policies can be achieved without sacrificing their effectiveness, we also prove that if the dispersion matrix is positive definite, the optimal candidate policies could correspond to effective disagreements across policies dispersed embeddings.

$\bullet$

Efficient Performance With our effective module design and theoretical analysis, our policy diversity learning scheme can be combined with any off-policy reinforcement learning algorithm, and the experimental results of various environments show that it outperforms multiple recently proposed baselines with both non-Markovian and Markovian environments.

The rest of this paper is organized as follows. In Section 2, we introduce the related work. In Section 3, we introduce the preliminary knowledge about Non-Markovian decision process and policy diversity learning. The methodology of policy dispersion is presented in Section 4, where 4.1 presents the policy representation module which is employed to generate effective and diverse embeddings for different policies, and 4.2 describes the process of dispersing the dynamically updating trajectories over embedding reconstruction to learn diverse policies. In Section 5, we conduct various experiments to demonstrate the effectiveness and robustness of D2PRD. We conclude this paper in Section 6.

2 Related Work

In non-Markovian environments, the reward function has a temporal nature, which means the agent receives its reward for temporally-extended behaviors sparsely. This leads the agent to acquire similar policies and overfit the given task. To mitigate this problem, the agent needs to learn diverse policies. In such a scenario, the policy representation learning was proposed and attracted the eyes of the learning community. Related works thus focus on the below three aspects as the rising of our curtain.

Non-Markovian Environment. In a standard Reinforcement Learning (RL) problem setting of a Markov decision process (MDP), rewards depend only on the most recent state-action pair. In a non-Markovian reward decision process (NMRDP) [38], rewards depend on the history of state-action pairs [38]. Building in temporal logics over finite traces, [35] adopt linear dynamic logic on finite traces for specifying non-markovian rewards and provide an automaton construction algorithm for building a Markovian model. In another paper [39], the authors are concerned with both the specification and effective exploitation of non-Markovian rewards in the context of MDPs. They specify non-Markovian reward-worthy behavior in LTL. Similarly, [40] use truncated LTL as a reward specification language, and [41] use ${\rm LTL}_{f}$ to specify desired complex behavior. Because temporal formulas are evaluated over an entire trace, it is difficult to guide the RL agent locally towards desirable behaviors. Data-driven approaches to learning NMRDP problems [42] often make use of domain-specific propositions and temporal logic operators [45, 43, 44].

Learning diverse policies via Reinforcement Learning. Our method can be grouped into this category. Some Reinforcement Learning (RL) based methods have been developed to explore diverse behaviors [46, 47]. The GEP-PG algorithm [48] uses Goal Exploration Processes [49] to generate diverse policies and combines them with the Deep RL algorithm DDPG [8], which performs well in continuous control tasks. The RSPO [50] explores diverse policies by solving a filtering-based objective, which restricts RL policies from converging to a solution that differs from a set of local optimal policies. P3S-TD3 [51] method periodically determines the best policy among all learners and assigns the best policy parameters to all learners so that the learner can search for a better policy under the guidance of the best policy.

Policy representation learning. A generative method is proposed in [52], which proposed an encoder-decoder method for modeling the agent’s policy. The encoder learns a point-based representation of different agent trajectories, and the decoder learns to reconstruct the modeled agent’s policy. Two meta-learning methods are proposed in [54, 53], and they both regard the latent generative representation of learning model parameters as the policy representation, and the method in [53] shows more stable performance. [55] proposed relational forward models to model agents using graph neural networks. [56] uses a VAE for agent modeling for fully-observable tasks. [57] proposed the Theory of mind Network (TomNet), which learns embedding-based representations of modeled agents for meta-learning.

Learning diverse policies and effective policy representations is difficult with non-Markovian rewards since the rewards depend on the history of states and actions, and the supervision signals are usually sparse. Moreover, diverse policies should be obtained without sacrificing their effectiveness. To the best of our knowledge, our method is the first method that discusses policy diversity learning in non-Markovian environments.

3 Preliminaries

In this section, we present the necessary background relevant to the problem setting of this work.

Markov Decision Process (MDP) is a mathematical framework to describe an environment in reinforcement learning. An MDP is a tuple $\mathcal{M}=\left\langle S,A,P,R,\gamma\right\rangle$ , where $S$ and $A$ represent state space and action space, respectively. The state transition dynamic function is given by $P:S\times A\rightarrow S$ , which is a mapping from the current state $s\in S$ to the next state $s^{\prime}\in S$ . The reward function is given by $R:S\times A\rightarrow\mathbb{R}$ , mapping from state $s\in S$ and action $a\in A$ to reward $r\in\mathbb{R}$ . $\gamma\in[0,1)$ denotes the discount factor.

A Policy $\pi:S\rightarrow A$ is a mapping from $S$ to $A$ . A trajectory $\tau\in\scalerel*{\tau}{T}$ is a sequence of state-action pairs, $\tau=((s_{0},a_{0}),...,(s_{T-1},a_{T-1}))$ . In deep reinforcement learning, policy $\pi$ is typically a neural network, encoded by parameter vectors $\theta$ , and the goal is to optimize parameters $\theta$ of $\pi$ such that an agent equipped with policy $\pi_{\theta}$ in the environment described by a fixed MDP maximizes $R(\tau)=\sum_{t=1}^{T}\gamma^{t-1}r_{t}$ , the expected cumulative reward of a trajectory $\tau$ over an episode time-step horizon $T$ (assumed to finite). The typical objective function of policy $\pi_{\theta}$ is as follows:

[TABLE]

Non-Markovian Decision Process[38, 45, 58, 34] (NMDP) extends MDP and assumes transition and reward functions depend on the history of state-action pairs. Formally, an NMDP is a tuple $M=\left\langle S,A,tr,R^{+},\gamma\right\rangle$ , where $S,A$ are as in an MDP, $tr:S^{*}\times A\times S\rightarrow\Pi(S)$ is the transition function, i.e., $tr((s_{0},\ldots,s_{k}),a,s^{\prime})$ is the probability of reaching state $s^{\prime}$ when executing action $a$ given history $s_{0},...,s_{k}$ . Here $S^{*}$ is the set of finite, non-empty, state sequences. A non-Markovian Reward (NMR) function is a mapping from the finite history of states and actions to reward $\mathbb{R}$ , denoted as $R^{+}:(S\times A)^{*}\rightarrow\mathbb{R}$ .

To make better exploration and have more robustness, novelty search methods [24, 59] and QD [19, 26] algorithms seek to find a set of policies with both high rewards and diverse behaviors, and explicitly augment the loss function with an additional term, as follows:

[TABLE]

where $\lambda\in(0,1)$ controls the trade-off between the effective (i.e., expected cumulative rewards) and diversity, and $\rm Div(\pi_{\theta^{i}})=\frac{1}{M}\sum_{i,j\in M,i\neq j}{||b(\pi_{\theta^{i}})-b(\pi_{\theta^{j}})||}_{2}$ is the diversity of a particular policy $\pi_{\theta^{i}}$ computed as the average Euclidean distance between $b(\pi_{\theta^{i}})$ and $\{b(\pi_{\theta^{j}})\}_{j\in M,j\neq i}$ , $b(\pi_{\theta^{i}})$ denotes the behavior characterization of policy $\pi_{\theta^{i}}$ . In the following, for the convenience of representation, we will simplify $\pi_{\theta^{i}}$ into $\pi_{i}$ .

4 Methodology

In this paper, we model the policy update as embedding dispersion, deriving different trajectories to explore policy diversity. In non-Markovian environments, it is difficult to learn effective policy embeddings since the rewards depend on the history of state-action pairs, namely, the embeddings need to capture reward-relevant historical information, and the supervision signals are usually sparse. We present a Transformer-based policy representation module, which is capable of learning effective policy embeddings in non-Markovian environments. Meanwhile, to avoid the lack of supervision signal problem caused by non-Markovian rewards, we design a sample categorical distribution to sample trajectories for training the policy representation model. Further, we propose a policy dispersion scheme to disperse policy embeddings along different trajectories as the policy update progresses and use the embeddings to construct a dispersion matrix that guides the learning of diverse policies.

Our proposed method, D2PED (Discovering Diverse Policies via Embedding Dispersion), comprises two primary submodules: the policy representation module and the policy dispersion module. The schematic illustration of our method is shown in Fig. 2. In our method, $M$ parallel learners learn $M$ distinct policies and share a common replay buffer $D$ . The $M$ learners execute parallelly in different copies of the same environment and employ a common base algorithm which can be any off-policy RL algorithm.

4.1 Policy Representation Module

The policy representation module is employed to generate effective and diverse embeddings for different policies. Inspired by [52], trajectories consisting of state-action pairs can directly reflect the characteristics of the policy, and using trajectories to learn the policy embeddings is an effective method. In non-Markovian environments, rewards have a temporal nature in that each reward depends not only on the current state and action but also on the history of state-action pairs. For this, the policy embeddings should aggregate all the reward-relevant historical information that accumulates over an episode time-step horizon. So that we use a Transformer-based method, which is the current tool of choice for capturing long-horizon dependencies to learn the policy embeddings. Due to the temporal nature, non-Markovian rewards are usually sparse. To train the policy representation module with sparse supervision signals, we design a sample categorical distribution by the cumulative rewards of the trajectories to select samples for policy representation module training.

The representation module is shown in Fig. 2(b), which is utilized to obtain effective policy embeddings. Specifically, for the $m$ -th ( $m\in\{1,...,M\}$ ) policy $\pi_{m}$ , the representation module inputs a trajectory collected by $\pi_{m}$ and outputs a policy embedding $v_{m}$ for $\pi_{m}$ .

4.1.1 Sample Trajectories with Categorical Distribution

In most non-Markovian environments, supervision signals of policy training are usually sparse. In order to learn effective policy embeddings, it is necessary to capture all the reward-relevant historical information that accumulates over an episode time-step horizon. So if the rewards are sparse, it means that there are very few rewards that can be used to capture the information, which is not conducive to learning policy embedding. Thus, We build a sample categorical distribution by the cumulative rewards of trajectories to select high-quality (relatively high reward) trajectories to learn policy embeddings.

Let $\tau_{i,m}$ be the $i$ -th trajectory collected by policy $\pi_{m}$ , and $r_{i,m}$ denotes the accumulated reward of $\tau_{i,m}$ . We regard $r_{i,m}$ as a parameter of the categorical distribution and use this distribution to sample the trajectories for training policy $\pi_{m}$ . Specifically, we construct the sample categorical distribution as follows:

[TABLE]

where $\sigma$ is the sigmoid function, $N$ is the batch size that sample trajectories from replay buffer $D$ for policy $m$ , and $N$ is equal for each policy. A hyperparameter $T$ is used to adjust the “temperature” of the sampling distribution. The trajectory index $i_{m}$ is sampled by the categorical distribution for policy $\pi_{m}$ .

With the sample categorical distribution, trajectories with higher cumulative rewards will be sampled to train the model at a higher frequency, and $M$ policies are sampled multiple times using the distribution to obtain samples for learning $M$ policy embeddings. We regard the trajectory indexes sampled by Eq. (3) as the classification labels, which will be used to train the representation model by minimizing the loss Eq. (6) in Section 4.1.2.

4.1.2 Generative Policy Representations via Capturing Reward-Relevant Historical Information

The non-Markovian rewards depend on the history of states and actions rather than solely on the immediate state and action. Thus, it is essential to understand and mine the complex characteristics in the trajectories and capture all reward-relevant historical information. So that we use a Transformer-based architecture to capture long-horizon dependencies between states and actions via self-attention mechanisms.

To train the Transformer-based policy representation module, we model the problem as a trajectory classification task. First, We sample $N$ trajectories for each of the $M$ policies and shuffle all the trajectories. Then the trajectories are used to train the policy representation module and output policy embeddings. Finally, we regard the policy indexes of trajectories (the trajectories collected by which policy) as class labels, and we assign the class labels to the policy embeddings.

Specifically, we design the policy representation module as a mapping function $\scalerel*{\tau}{T}\rightarrow\mathcal{V}\rightarrow Y$ parameterized by $\phi$ , which is the combination of Eq. (4) and Eq. (5), where $\scalerel*{\tau}{T}={\{\tau_{i,m}\}_{i=1,m=1}^{N,M}}$ denotes the set of trajectories collected by policies $\{\pi_{m}\}_{m=1}^{M}$ , $\mathcal{V}={\{v_{m}\}_{m=1}^{M}}$ is the policy embeddings, and $Y={\{y_{m}\}_{m=1}^{M}}$ is the set of the policy indexes. The module inputs trajectories and outputs policy embeddings which are the essential ingredients to learning diverse policies. Then, the embeddings are classified into policy indexes to train this module.

The transformer architecture can efficiently model sequential data. This model consists of stacked self-attention layers with residual connections. Each self-attention layer receives embeddings generated by tokens and outputs embeddings that preserve the input dimensions. For $i$ -th $T$ time-step horizon trajectory collected by policy $\pi_{m}$ , $\tau_{i,m}=(s_{i,m}^{1},a_{i,m}^{1},...,s_{i,m}^{T},a_{i,m}^{T})$ , we let $\omega_{i,m}^{t}=(s_{i,m}^{t},a_{i,m}^{t})$ for each time step $t\in\{1,...,T\}$ as a token. Subsequently, we regard $(\omega_{i,m}^{t})_{t=1}^{T}$ as a sequence, and input to the Transformer encoder (i.e., a Layernorm (LN), a multiheaded self-attention (MSA) layer, and residual connections [60, 61]), to obtain the policy embedding:

[TABLE]

where $v_{i,m}\in\mathbb{R}^{K}$ is the $i$ -th embedding for policy $\pi_{m}$ and $E_{pos}$ is the positional encoding.

To train the policy representation module with the policy indexes $Y={\{y_{m}\}_{m=1}^{M}}$ , $v_{i,m}$ are processed with a classification layer (CL), which is formulated by:

[TABLE]

where $\xi_{i,m}$ is a vector in which the $j$ -th element ( $j\in\{1,2,...,M\}$ ) represents the probability that the policy embedding $v_{i,m}$ is assigned to the $j$ -th policy. Then we train the policy representation module and learn the policy embeddings by minimizing the loss function:

[TABLE]

where $y_{m}$ is the policy index (classification label) of $\tau_{i,m}$ obtained in Section 4.1.1.

We use the policy representation module to obtain $N$ policy embeddings for each of the $M$ policies. We then construct the diversity matrix by the policy embeddings, which are used to measure the diversity of the policy set and guide the policy dispersion in the space.

4.2 Policy Dispersion Module

The candidate policies of the agent usually have consistent initialization states, and their embeddings are concentrated in a unified region. Subsequently, the embeddings are repeatedly constructed as the policy update progresses, forming different dispersion trajectories. To obtain diverse dispersion trajectories, maximizing the dispersion disagreements of dynamically updating trajectories is effective. In this setting, those diverse policies over varying dispersion trajectories could provide various decisions for the agent.

The policy dispersion module is shown in Fig. 2(c), which aims to disperse the dynamically updating trajectories over embedding reconstruction, that is, learning diverse policies for agent decisions. The key to learning diverse policies is to measure the diversity of the candidate policies, for which we construct a dispersion matrix using the policy embeddings.

On this setting, the dispersion matrix stacks the embeddings of the candidate policies, modeling the process trajectory dispersion with direction guidelines. Following the Determinantal Point Processes (DPPs) [62], the determinant of the dispersion matrix that proportionally draws the dimensional span over the associated geometric region of the matrix, could be specified as a quantification of the disagreement of dispersion trajectories, that is, characterizing the diversity of the candidate policies.

To construct the dispersion matrix, we randomly select an embedding for each of the $M$ policies from $\{v_{i,m}\}_{i=1,m=1}^{N,M}$ , constituting $\{v_{m}\}_{m=1}^{M}$ . And stack the $M$ selected embeddings as the dispersion matrix. Further, we propose the following definitions to construct the dispersion matrix and compute the policy diversity measure as follows:

Definition 4.1.

(Dispersion Matrix) Consider $M$ policies $\{\pi_{m}\}_{m=1}^{M}$ , and their embeddings are denoted as $\{v_{m}\}_{m=1}^{M}$ , where $v_{m}\in\mathbb{R}^{K}$ . Let $V\triangleq[v_{1},...,v_{M}]$ , where $[\cdot]$ is the operator that stacks vectors into a Matrix. We define the Dispersion Matrix of policies as $\mathbf{S}\triangleq F(V)$ , where $F(\cdot)$ is a function that transforms the $M\times K$ matrix $V$ to a $K\times K$ positive-definite matrix.

Definition 4.2.

(Policy Diversity) We define the diversity of policies as the determinant of their dispersion matrix, denoted as $\rm Div(\{v_{m}\}_{m=1}^{M})=\det(\mathbf{S})$ .

The matrix $V$ constructed by Definition 4.1 is an $M\times K$ matrix. Suppose the number of candidate policies $M$ equals the dimension $K$ of the policy embeddings, i.e., $M=K$ , the determinant of the square matrix $V$ that represents the dispersion disagreement of dynamically updating trajectories over embedding reconstruction, could be a feasible measurement for the diversity of the $M$ policies. However, for the scenarios of $M\neq K$ , the function $F(\cdot)$ could be adopted to transform the polyhedron spanned by $M$ $K$ -dimensional embeddings to a parallelepiped spanned by a $K$ -by- $K$ positive-definite matrix in another embedding space. Then the determinant of the square matrix still could measure the diversity of the $M$ candidate policies.

Now consider the total objective function ${J}(\Pi)$ of the candidate policies $\Pi=(\pi_{1},...,\pi_{m})$ , which is the summation of two terms: the first term is the same objective function $J(\pi_{m})$ as the base algorithm, and the second term is the policy diversity $\rm Div(\{v_{m}\}_{m=1}^{M})$ . Specifically, the D2PED objective function is as follows:

[TABLE]

where $\beta\in(0,1)$ controls the trade-off between $J(\pi_{m})$ and $\rm Div(\{\pi_{m}\}_{m=1}^{M})$ .

Next, we will give a theory to justify that our method can ensure diversity and high-quality policies.

Theorem 4.1.

Consider $M$ policies for an environment characterized by finite NMDP. Suppose optimal policy $\pi_{m}$ achieves a cumulative reward of $R$ and suboptimal policy $\tilde{\pi}$ achieves a cumulative reward of $R(\tilde{\pi})$ with $R(\tilde{\pi})+\Delta<R$ for some $\Delta>0$ . If dispersion matrix $\mathbf{S}$ is positive definite, and $\Lambda\triangleq\prod_{i=1}^{K}s_{ii}<\frac{(1-\beta)}{\beta}\Delta$ , then the objective in Eq. (9) can only be maximized when all policies are optimal.

The proof of Theorem 4.1 is in Appendix A.1.

A method to construct a dispersion matrix. The theory holds if the dispersion matrix is positive definite, and we give a method to construct a positive definite dispersion matrix. In general, we compute the covariance matrix of the policy embeddings and then regard the determinant of the covariance matrix as the diversity of policies in Definition 4.2. Specifically, we construct the covariance matrix $\mathbf{S}$ of $(v_{m})_{m=1}^{M}$ , where $v_{m}\triangleq(v_{m,k})_{k=1}^{K}$ . The $(i,k)$ -th element of $\mathbf{S}$ is ${s_{ik}=\frac{1}{M-1}\sum_{m=1}^{M}(v_{mi}-\bar{v}_{i})(v_{mk}-\bar{v}_{k})},$ where $M$ is the number of policies, and $\bar{v}_{i}\triangleq\frac{1}{M}\sum_{j=1}^{M}v_{ji}$ . However, the covariance matrix is semi-positive definite, in this section, we use the following method to make it positive definite. If $\det(\textbf{S})=0$ , we replace S with $\tilde{\textbf{S}}$ , where $(i,i)$ -th element is $\tilde{s}_{ii}=s_{ii}+\sum_{j\neq i}|s_{ij}|.$ According to Gershgorin circle theorem [63], the modified matrix $\tilde{\textbf{S}}$ is positive definite. In statistical analysis, the determinant of the covariance matrix is termed the generalized variance [64], which is proportional to the sum of squares of the volumes of all the different parallelotopes formed by the policy embeddings using as principal edges [65, 66]. Note that the volume of the enclosed geometric region of those policy embeddings also can reflect the policy diversity, where the detailed explanation is presented in Appendix A.2. Therefore, $\det(\textbf{S})$ could be specified as the quantification of the disagreement of dispersion trajectories, that is, the diversity measure to the candidate policies.

The D2PED algorithm is summarized in Alg. 1. D2PED employs a dispersion matrix to measure the diversity of the candidate policies and adds the determinant of the dispersion matrix to the policy objective function. To improve the efficiency of the algorithm, $M$ learners (i.e., $M$ diverse policies) execute parallelly in different copies of the same environment and share a shared replay buffer.

5 Experiments

In this section, we present our experimental results and discuss their implications. First, we provide an overview of the experimental setup and the environments we used to evaluate our models in Section 5. Second, in Sections 5.2 $\sim$ 5.3, we report results in several different environments respectively. Then, we further provide an ablative analysis 5.4 of the proposed methodology. Finally, we summarize hyperparameters 5.5 used in experiments.

5.1 Experimental Setup

We conduct experiments to evaluate the performance of our method from three settings of continuous action space, discrete action space, and robustness. In non-Markovian environments, rewards have a temporal nature in that each reward depends not only on the current state and action but also on the history of state-action pairs. In Markovian environments, reward only depends on the immediate state and action. Therefore, in a sense, the Markovian environment is a special case of the non-Markovian environment, that is, the history of state-action pairs that the reward depends on is only the current state-action pair. To evaluate the robustness of our method, we also conduct experiments in Markovian environments.

5.1.1 Environments

Point. To explicitly examine whether our method can efficiently find high-quality policies with diverse behaviors in the non-Markovian environment, we create the Point-v1 environment with non-Markovian rewards and multi-solution, which is modified from the Point-v0 [29] and has continuous state and action space. An agent and a target (i.e., green cuboid) are separated by three U-shape walls (see: Fig. 3). When an episode starts, the target randomly sends a signal to the agent. When the agent receives the signal and reaches the target, it will get an event reward of 100. If there is no signal at the target, the agent starts to move, or if the agent hits a wall while moving, it will get an event reward of -100. The current episode ends immediately if the agent hits the wall or reaches the target. The agent only obtains a total reward at the end of each episode. The total reward is the sum of the event reward and the distance reward, which is the negative distance between the agent and the target at the end of the episode.

FrozenLake. For discrete control tasks, we perform experiments in the FrozenLake-v1 environments with non-Markovian rewards, which is a grid world game, some grids are walkable, and some grids are holes that will make the agent fall into the water (see: Fig. 4). At the beginning of an episode, the agent will stand by in the upper left corner of the map, with the target grid in the lower left corner. The task randomly sends a start signal to the agent. If the agent receives the signal and reaches the goal grid, a reward of 1 is obtained. If the agent starts moving without receiving the signal, a reward of 0 is obtained. If the target is not reached within a finite time step after receiving the signal or the agent falls into the hole, the obtained reward is still 0. As the size of grids and the density of holes increase, it becomes increasingly difficult for agents to reach the target grid.

MuJoCo. To examine the robustness of our method in environments with Markovian reward, we conduct experiments in various tasks (Hopper (Fig. 5(a)), Halfcheetah (Fig. 5(b)), Humanoid (Fig. 5(c))) from the OpenAI Gym library [67]. In each task, the agent takes as input a vector of physical states and generates a vector of action values to manipulate the robots in the environment. We also create two multi-modal environments based on the HalfCheetah and Ant, where we assign rewards for both Forward and Backward tasks to examine our method’s effectiveness in finding high-quality and diverse policies in Markovian reward environments.

5.1.2 Baseline Methods

The baseline methods (or simply “baselines") adopted for comparison are the same within different environments. We select DvD [29], QD-RL [68], and P3S [51] as the baselines. For a fair comparison, we combine our method and baselines with the same base algorithm (i.e., DQN, PPO, TD3) and use the same hyperparameters for each environment. The number of candidate policies $M$ of these algorithms is always set to 5. We report the mean and standard deviations across five identical seeds for all algorithms and all tasks. The experiments are performed using the ray [69] library for multi-process parallel computation on a computer with 16 cores.

5.2 Learning Diversity Policies in Non-Markovian Reward Environments

In this experiment, we consider two environments, Point-v1 and FrozenLake-v1, with non-Markovian rewards and multi-solutions to explicitly examine whether D2PED can efficiently find high-quality policies with diverse behaviors.

Continuous Control. In the Point-v1 environment, we combine our method and baselines with PPO [12]. The results are shown in Fig. 6. It can be observed that DvD-PPO achieves the second-best performance but is still 20% worse than our method. DvD-PPO neglects the long-horizon dependencies from the history of states and actions in non-Markovian rewards environments, leading to imperfect policy embeddings and dispersion trajectories that cannot provide effective diversity guidance in non-Markovian environments. QD-PPO shows unstable performance and unsatisfied policy diversity, which is due to its instability in handling sparse rewards. P3S-PPO finally gets trapped into a less-attractive local optimal. P3S-PPO searches policies only according to the previous best policy, which causes a rapid improvement of a particular policy but may also lead to a search in a small-range policy space. The result implies that the D2PED-PPO can achieve better performance with much fewer training iterations, which shows the advantage of the policy dispersion scheme of the D2PED method in non-Markovian reward environments.

To intuitively measure the diversity of different methods, we visualize the paths of five candidate policies of each method which are learned after 500 training iterations in the Point-v1 environment in Fig. 7 for our method and baselines. The lines represent the paths of five candidate policies, the orange point and the green square represent the start position and target position, respectively, and the black polylines represent the walls that separate the agent and the target.

Overall, all policies of our method reach the target successfully in the Point-v1 environment. Moreover, our method manages to bypass the three walls over the upper wall, beneath the bottom wall, and through two gaps between walls, i.e., the gap between the middle wall and the upper wall and the gap between the middle wall and the bottom wall, which demonstrates the excellence of our method in learning diverse policies. Besides, the paths of our method from the same side are also different from each other.

In comparison, the paths of P3S-PPO overlapped over a long period, and most policies got stuck in front of the walls, which demonstrates that P3S-PPO can not work well in the non-Markovian environment. For QD-PPO, only two policies can bypass the wall from the gap between the middle wall and the upper wall, and the others are stuck in front of the walls. One of the policies hit the edge of the upper wall and stumble to the target. DvD-PPO shows similar results.

A fraction of the candidate policies of DvD-PPO, QD-PPO, and P3S-PPO can reach the target, and these policies can only move through two gaps. In comparison, all of the policies of our method can reach the target from different paths, including gaps between walls and spaces over or beneath the walls, demonstrating that the policy dispersion scheme of our D2PED is capable of achieving better performance in learning diverse policies.

Discrete Control. We conduct experiments in $4\times 4$ grids, $5\times 5$ grids, and $8\times 8$ grids three versions of FrozenLake-v1 environments, and we combine our method D2PED and baselines with DQN [70]. For each policy of each method, we first train 4000 episodes, then we show in Table I the average number of times the five candidate policies for each method reach the target in the three versions of FrozenLake-v1 environments. D2PED-DQN reaches the target more times than DvD-DQN, P3S-DQN, and QD-DQN in all three environments. The experimental results show that the policy dispersion scheme of D2PED in non-Markovian reward environments provides better exploration performance and enables agents to learn effective diversity policies.

We visualize the paths of the candidate policies of each method which are learned after 4000 training episodes in one of the FrozenLake-v1 environments (see Fig. 8). The environment is an $8\times 8$ grid with $28$ holes, where square S in the upper left corner and square G in the lower right corner represent the start position and target position, respectively. Squares marked with H represent holes, and blank squares represent ice. We use different colors of polylines to represent paths of different policies.

As shown in Fig. 8, all policies of our method reach the target. Besides, our D2PED method learned three types of routes to reach the target, including the most tortuous route in the middle. P3S-DQN only learns one type of route to the target, and one of the policies falls into a hole. This demonstrates the P3S-DQN can only learn similar policies and can not properly handle non-Markovian rewards. QD-DQN also only learns one type of route to the target. Although QD-DQN explores the other two kinds of paths, all these policies fail to reach the target and fall into holes. DvD-DQN learns two types of routes to the target, and a policy still drives the agent to a hole. Only a part of the candidate policies of DvD-PPO, QD-PPO, and P3S-PPO can reach the target, and these policies can only generate one or two types of routes, while D2PED-DQN can explore three types of routes, which demonstrates the effectiveness of the policy dispersion scheme of D2PED in learning diverse policies.

5.3 Performance Comparison in Markovian Reward Environments

Single modal. To examine the robustness of D2PED in environments with Markovian reward, we conduct experiments in three standard MuJoCo environments. We combine our method D2PED and baselines with TD3 [13]. The accumulated rewards versus the number of training iterations are shown in Fig. 9.

In the three environments, the convergence speed of D2PED-TD3 is faster than DvD-TD3, QD-TD3, and P3S-TD3. We can observe that D2PED-TD3 can always achieve the best final performance. Although P3S-TD3 can sometimes learn faster at the early stage of optimization, e.g., in the Humanoid-v2 environment, its final performance is worse than the other methods. The results show our method can achieve better performance than baselines in environments with Markovian rewards. Moreover, the results also reflect that during the training process, policy diversity learning by the policy dispersion scheme of D2PED is not achieved by sacrificing effectiveness.

Multi modal. To examine the effectiveness of D2PED in learning diverse and high-quality policies in Markovian environments. We change the original single-mode environments HalfCheetah and Ant into two-mode environments by assigning rewards for both Forward and Backward tasks. All the methods are trained in two-mode environments and tested in single-mode environments.

To compare different algorithms more clearly, we organize the experimental results into Table II. We observe PS3-TD3 can learn to handle Forward tasks but fails in Backward tasks of both environments. DvD-TD3 can learn to handle both tasks of the Ant environment and the Backward task of the HalfCheetah Environment. Only QD-TD3 and our method can handle both tasks in both environments, but our method can obtain larger cumulative rewards than QD-TD3, demonstrating its good performance in learning high-quality and diverse policies. This demonstrates D2PED can learn high-quality and diverse policies in Markovian reward environments.

The experimental results of single-modal and multi-modal reflect that the performance of our method outperforms the baselines, so our method still has good performance in the Markov environment, demonstrating the strong robustness of our method.

5.4 Ablation Studies

In this section, we conduct ablation studies to analyze the relative contribution of our policy representation module and the sensitivity of the number of candidate policies $M$ .

The contribution of our policy representation module. One of the crucial parts of this work is the policy representation module. To demonstrate the importance of our policy representation module, we replace it with two policy embedding methods: auto-encoder and behavioral embedding, respectively. The auto-encoder method takes the trajectories generated by $\pi_{m}$ as input and outputs the embedding for $\pi_{m}$ . The behavioral embedding method is an action-based behavior characterization method. Following the implementation in [29], we randomly select 20 states and concatenate the actions of the selected states as the policy embedding.

The experimental results of $7$ environments are shown in Fig. 10 and Table III. Using the policy embeddings generated by the auto-encoder method and the behavioral embedding method will significantly reduce the performance of D2PED. The behavior embedding method can not extract the mutual dependencies between states and actions, and the auto-encoder can not properly characterize the long-horizon dependencies of states and actions.

The sensitivity of the number of candidate policies. The number $M$ of candidate policies directly affects the number of policy embeddings used to construct the Diversity Matrix. If the number of candidate policies is too small, it is less likely to find most of the optimal policies with diverse behaviors. If the number of candidate policies is too large, some policies with similar behavior will be selected and updated, harming the efficiency of D2PED.

Fig. 11 and Table IV show the performance of D2PED with different number of candidate policies. We report the results under the same number of iterations. It can be observed that the D2PED can achieve better performance than the baseline methods when $M$ takes $3$ , $5$ , and $7$ , which shows our method is not quite sensitive to the policy set size.

5.5 Hyperparameters

Our method combines with different reinforcement learning methods (PPO, DQN, TD3) in the experiments. For all methods, Adam is used as the gradient optimizer. In every experiment, our method trains policy embeddings. The hyperparameters and details are reported below.

Policy embedding In Table V we show the hyperparameters of the policy representation module of D2PED. Three versions of FrozenLake-v1 environments use the same settings. Since the state dimensions of the Point-v1 environment and the Hopper environment are smaller than other environments, the embedding dimensions and attention depths of these two environments are also set to be smaller.

D2PED-PPO In Table VI, we show the hyperparameters used in PPO when combined with D2PED. The networks for policy and critic have two hidden layers of dimension $64$ . The non-linearity function of the hidden layers is tangent.

D2PED-DQN In Table VII, we show the hyperparameters used in DQN when combined with D2PED. Three FrozenLake-v1 environments used the same settings. The network has a hidden layer of dimension 10, and the non-linearity function of the hidden layers is ReLU.

D2PED-TD3 In Table VIII, we show the hyperparameters used in TD3 when combined with D2PED. The networks for two Q-functions and the policy have 2 hidden layers. Both the first and second layers have a dimension of 256. The non-linearity function of the hidden layers is ReLU. The last layer of the actor network has a tangent activation function, whereas the last layer of the critic networks has a linear activation function.

6 Conclusion

In this paper, we introduced D2PED, a method for learning effective diverse policies for control tasks in non-Markovian environments. D2PED designs a policy dispersion scheme that repeatedly constructs policy embeddings as the policy update progresses, forming different dispersion trajectories and maximizing the dispersion disagreements of dynamically updating trajectories. To guarantee our idea and the proposed method, we analyze that the diversity of candidate policies could be achieved without sacrificing their effectiveness, and the subsequent experimental results of non-Markovian environments also present evidence to support this analysis. Moreover, the experimental results also show that D2PED could provide more effective policy exploration and diverse behaviors than the selected baselines. With such achievements, we further examine the performance of D2PED in Markov environments which verifies its robustness.

Acknowledgments

The authors would like to thank…

Appendix A

A.1 Theoretical Results

A.1.1 Proof of Theorem 4.1

Proof.

Recall that $\rm Div(\{v_{m}\}_{m=1}^{M})=det(\mathbf{S})$ . Since $\mathbf{S}$ is positive definite, according to Hadamard’s inequality [71], the following bounds hold:

[TABLE]

where $s_{ii}$ is the $(i,i)$ -th element of $\mathbf{S}$ . Let $\{\tilde{\pi}_{m}\}_{m=1}^{M}$ be a set of policies with at least one suboptimal policy and $\{\pi_{m}\}_{m=1}^{M}$ be a set of optimal policy. Suppose optimal policy $\pi_{m}$ achieves a cumulative reward of $R$ and suboptimal policy $\tilde{\pi}$ achieves a cumulative reward of $R(\tilde{\pi})$ with $R(\tilde{\pi})+\Delta<R$ for some $\Delta>0$ , the following formula holds:

[TABLE]

For the set of optimal policies $\{\pi_{m}\}_{m=1}^{M}$ , we have:

[TABLE]

From 9 and 10, if $\Delta>\frac{\beta}{1-\beta}\Lambda$ , we have:

[TABLE]

Since $\rm Div(\{v_{m}\}_{m=1}^{M})>0$ , there exist at least $M$ distinct solutions, we conclude that whenever $\Delta>\frac{\beta}{1-\beta}\Lambda$ , the objective in (9) can only be maximized when all policies are optimal.

∎

A.2 Explanation for Generalized Variance

Consider $M$ policy embeddings $v=(v_{m})_{m=1}^{M}$ with dimension $k$ . We stack these vectors as a matrix:

[TABLE]

We then construct the covariance matrix $\mathbf{S}$ of $(v_{m})_{m=1}^{M}$ , where $(i,k)$ -th element is:

[TABLE]

where $\bar{v}_{i}\triangleq\frac{1}{M}\sum_{m=1}^{M}v_{mi},1\leq i\leq K$ . The covariance matrix $\mathbf{S}$ is denoted as:

[TABLE]

The determinant of $\mathbf{S}$ is called generalized variance [64], which measures the degree of scatter of the sample data. A large $\det(\textbf{S})$ corresponds to sufficient dispersed data points. Next we explain it in terms of both entropy and geometry.

From the perspective of entropy, we assume the distribution of the policy embeddings follows the Gaussian distribution, and their differential entropy is formulated as:

[TABLE]

where $K$ is the dimension of the space. We can observe the entropy of the policy embeddings monotonically increases with the determinant of $\mathbf{S}$ . According to the nature of entropy, a large determinant of $\rm\mathbf{S}$ corresponds to a large $H$ , which means that the policy embeddings are sufficient dispersed in the policy space.

From a geometric perspective, the generalized variance of the policy embeddings is proportional to the sum of squares of the volumes of all the different parallelotopes formed by the policy embeddings using as principal edges [65, 66]. Thus, $\det(\textbf{S})$ could be specified as the quantification of the disagreement of dispersion trajectories, that is, the diversity measure to the candidate policies. Recall that $\bar{v}=(\bar{v}_{1},\bar{v}_{2},\cdots,\bar{v}_{K})$ are the means of the rows of $\mathbf{V}$ . Then a hyperellipsoid(an ellipse if $K=2$ ) centered at $\bar{v}$ is defined as

[TABLE]

the volume of this hyperellipsoid is expressed as:

[TABLE]

where $a_{x}$ is a constant scalar. We can observe that the volume of the hyperellipsoid monotonically increases with $\det(\mathbf{S})$ . A large volume corresponds to a large generalized variance, a large disagreement of dispersion trajectories.

Bibliography71

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1]
2[2] D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, A. Huang, A. Guez, T. Hubert, L. Baker, M. Lai, A. Bolton et al. , “Mastering the game of go without human knowledge,” nature , vol. 550, no. 7676, pp. 354–359, 2017.
3[3] D. Silver, T. Hubert, J. Schrittwieser, I. Antonoglou, M. Lai, A. Guez, M. Lanctot, L. Sifre, D. Kumaran, T. Graepel et al. , “A general reinforcement learning algorithm that masters chess, shogi, and go through self-play,” Science , vol. 362, no. 6419, pp. 1140–1144, 2018.
4[4] D. Zha, J. Xie, W. Ma, S. Zhang, X. Lian, X. Hu, and J. Liu, “Douzero: Mastering doudizhu with self-play deep reinforcement learning,” in International Conference on Machine Learning . PMLR, 2021, pp. 12 333–12 344.
5[5] E. Zhao, R. Yan, J. Li, K. Li, and J. Xing, “Alphaholdem: High-performance artificial intelligence for heads-up no-limit poker via end-to-end reinforcement learning,” in Proceedings of the AAAI Conference on Artificial Intelligence , vol. 36, no. 4, 2022, pp. 4689–4697.
6[6] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski et al. , “Human-level control through deep reinforcement learning,” nature , vol. 518, no. 7540, pp. 529–533, 2015.
7[7] T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine, “Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor,” in International conference on machine learning . PMLR, 2018, pp. 1861–1870.
8[8] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra, “Continuous control with deep reinforcement learning,” in International Conference on Learning Representations , 2016.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

Policy Dispersion in Non-Markovian Environment

Abstract

Index Terms:

1 Introduction

2 Related Work

3 Preliminaries

4 Methodology

4.1 Policy Representation Module

4.1.1 Sample Trajectories with Categorical Distribution

4.1.2 Generative Policy Representations via Capturing Reward-Relevant Historical Information

4.2 Policy Dispersion Module

Definition 4.1**.**

Definition 4.2**.**

Theorem 4.1**.**

5 Experiments

5.1 Experimental Setup

5.1.1 Environments

5.1.2 Baseline Methods

5.2 Learning Diversity Policies in Non-Markovian Reward Environments

5.3 Performance Comparison in Markovian Reward Environments

5.4 Ablation Studies

5.5 Hyperparameters

6 Conclusion

Acknowledgments

Appendix A

A.1 Theoretical Results

A.1.1 Proof of Theorem 4.1

Proof.

A.2 Explanation for Generalized Variance

Definition 4.1.

Definition 4.2.

Theorem 4.1.