# Policy Dispersion in Non-Markovian Environment

**Authors:** Bohao Qu, Xiaofeng Cao, Jielong Yang, Hechang Chen, Chang Yi, Ivor, W.Tsang, Yew-Soon Ong

arXiv: 2302.14509 · 2024-06-04

## TL;DR

This paper introduces a transformer-based policy dispersion method for non-Markovian environments, enabling the learning of diverse, expressive policies that improve robustness and adaptability in reinforcement learning tasks.

## Contribution

It proposes a novel policy dispersion scheme using transformer-based embeddings and a positive definite dispersion matrix to enhance policy diversity in non-Markovian settings.

## Key findings

- Diverse policies lead to more robust performance.
- The dispersion scheme outperforms recent baselines.
- Positive definite dispersion matrix enlarges policy disagreements.

## Abstract

Markov Decision Process (MDP) presents a mathematical framework to formulate the learning processes of agents in reinforcement learning. MDP is limited by the Markovian assumption that a reward only depends on the immediate state and action. However, a reward sometimes depends on the history of states and actions, which may result in the decision process in a non-Markovian environment. In such environments, agents receive rewards via temporally-extended behaviors sparsely, and the learned policies may be similar. This leads the agents acquired with similar policies generally overfit to the given task and can not quickly adapt to perturbations of environments. To resolve this problem, this paper tries to learn the diverse policies from the history of state-action pairs under a non-Markovian environment, in which a policy dispersion scheme is designed for seeking diverse policy representation. Specifically, we first adopt a transformer-based method to learn policy embeddings. Then, we stack the policy embeddings to construct a dispersion matrix to induce a set of diverse policies. Finally, we prove that if the dispersion matrix is positive definite, the dispersed embeddings can effectively enlarge the disagreements across policies, yielding a diverse expression for the original policy embedding distribution. Experimental results show that this dispersion scheme can obtain more expressive diverse policies, which then derive more robust performance than recent learning baselines under various learning environments.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/2302.14509/full.md

## Figures

38 figures with captions in the complete paper: https://tomesphere.com/paper/2302.14509/full.md

## References

71 references — full list in the complete paper: https://tomesphere.com/paper/2302.14509/full.md

---
Source: https://tomesphere.com/paper/2302.14509