Object Exchangeability in Reinforcement Learning: Extended Abstract

John Mern; Dorsa Sadigh; Mykel Kochenderfer

arXiv:1905.02698·cs.LG·May 8, 2019

Object Exchangeability in Reinforcement Learning: Extended Abstract

John Mern, Dorsa Sadigh, Mykel Kochenderfer

PDF

Open Access

TL;DR

This paper introduces an attention-based input representation method for deep reinforcement learning that is invariant to input ordering, significantly improving sample efficiency and enabling solutions to previously intractable problems.

Contribution

The paper proposes a novel attention-based representation technique that reduces the search space and enhances sample efficiency in reinforcement learning tasks.

Findings

01

Improved sample efficiency in policy gradient methods.

02

Reduced search space by a factor of m! for m objects.

03

Ability to solve complex problems previously intractable.

Abstract

Although deep reinforcement learning has advanced significantly over the past several years, sample efficiency remains a major challenge. Careful choice of input representations can help improve efficiency depending on the structure present in the problem. In this work, we present an attention-based method to project inputs into an efficient representation space that is invariant under changes to input ordering. We show that our proposed representation results in a search space that is a factor of m! smaller for inputs of m objects. Our experiments demonstrate improvements in sample efficiency for policy gradient methods on a variety of tasks. We show that our representation allows us to solve problems that are otherwise intractable when using naive approaches.

Equations4

∣ S ∣ = \frac{n !}{( n - m )!}, ∣ \hat{S} ∣ = \frac{n !}{m ! ( n - m )!}

∣ S ∣ = \frac{n !}{( n - m )!}, ∣ \hat{S} ∣ = \frac{n !}{m ! ( n - m )!}

f(X)=\rho\Big{(}\sum_{x\in\mathcal{X}}\phi(x)\Big{)}

f(X)=\rho\Big{(}\sum_{x\in\mathcal{X}}\phi(x)\Big{)}

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics · Adversarial Robustness in Machine Learning · Evolutionary Algorithms and Applications

Full text

\settopmatter

printacmref=true

\setcopyrightifaamas \acmDOI \acmISBN \acmConference[AAMAS’19]Proc. of the 18th International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2019)May 13–17, 2019Montreal, CanadaN. Agmon, M. E. Taylor, E. Elkind, M. Veloso (eds.) \acmYear2019 \copyrightyear2019 \acmPrice

\affiliation\institution

Stanford University \cityStanford \stateCalifornia \postcode94305

\affiliation\institutionStanford University \cityStanford \stateCalifornia \postcode94305

Object Exchangeability in Reinforcement Learning

Extended Abstract

John Mern

[email protected]

,

Dorsa Sadigh

[email protected]

and

Mykel Kochenderfer

[email protected]

Abstract.

Although deep reinforcement learning has advanced significantly over the past several years, sample efficiency remains a major challenge. Careful choice of input representations can help improve efficiency depending on the structure present in the problem. In this work, we present an attention-based method to project inputs into an efficient representation space that is invariant under changes to input ordering. We show that our proposed representation results in a search space that is a factor of $m!$ smaller for inputs of $m$ objects. Our experiments demonstrate improvements in sample efficiency for policy gradient methods on a variety of tasks. We show that our representation allows us to solve problems that are otherwise intractable when using naïve approaches.

Key words and phrases:

Knowledge Representation; Reasoning; Reinforcement Learning

1. Introduction

Deep reinforcement learning (RL) has achieved state-of-the-art performance across a variety of tasks Mnih et al. (2013); Silver et al. (2017). However, successful deep RL training requires large amounts of sample data. Various learning methods have been proposed to improve sample efficiency, such as model-based learning and incorporation of Bayesian priors Gu et al. (2016); Spector and Belongie (2018).

The key insight of this paper is that we can significantly improve efficiency by leveraging the exchangeable structure inherent in many reinforcement learning problems. That is, for a state space that can be factored into sets of sub-states, presenting the factored state in a way that does not rely on a particular ordering of the sub-states can lead to significant reduction in the search-space.

In this work, we propose an attention mechanism as a means to leverage object exchangeability. We propose a mechanism that is permutation invariant in that it will produce the same output for any permutation of the items in the input set and show that this representation reduces the input search space by a factor of up to $m!$ , where $m$ is the number of exchangeable objects.

2. Background and Related Work

Deep RL is a class of methods to solve Markov Decision Processes (MDPs) using deep neural networks. Solving an MPD requires finding a policy $\pi$ that maps all states in a state-space $\mathcal{S}$ to an action to maximize total accumulated rewards.

Formally, we can define an object in an MDP to be a subset of the state space that defines the state of a single entity in the problem environment. In an aircraft collision avoidance problem, an object could be defined by the values associated with a single aircraft. It is well known that as the number of objects grow, the size of the MDP search space grows exponentially Robbel et al. (2016).

For MDPs with exchangeable objects, an optimal policy should provide the same action for any permutation of the input. When states are represented as ordered sets, as is common, this must be learned by the policy during training.

Many methods have been proposed instead to enforce this by permutation invariant input representations. The Object Oriented MDP (OO-MDP) framework uses object-class exchangeability to represent states in an order-invariant space for discrete spaces Diuk et al. (2008). Approximately Optimal State Abstractions Abel et al. (2016) proposes a theoretical approximation to extend OO-MDP to continuous domains. Object-Focused Q-learning Cobo et al. (2013) uses object classes to decompose the Q-function output space, though it does not address the input.

Deep Sets Zaheer et al. (2017) proposes a permutation invariant abstraction method to produce input vectors from exchangeable sets. The method proposed produces a static mapping. That is each input object is weighted equally regardless of value during the mapping.

Our method improves upon Deep Sets by introducing an attention mechanism to dynamically map the inputs to the permutation-invariant space. Attention mechanisms are used in various deep learning tasks to dynamically filter the input to a down-stream neural network to emphasize most important parts of the original input Xu et al. (2015); Luong et al. (2015); Jaderberg et al. (2015). We adapt a mechanism from recent work in natural language processing, which use a dot-product neural layer to efficiently apply dynamic attention Vaswani et al. (2017).

3. Problem Formalization

Our objective is to propose an attention mechanism that will take sets of objects $S_{i}$ as inputs and produce abstractions $S_{i}^{*}$ such that the mapping is permutation invariant. This output can then be used as an input to the policy neural network in an RL problem (e.g. a deep Q-Network or action policy). Our hypothesis is that learning using this abstract representation will be more sample efficient than learning on the original object set.

We propose the attention network architecture shown in fig. 1, which is a permutation invariant implementation of dot-product attention. For a single input set $S_{i}\leftarrow\{s_{i}^{(1)},\ldots,s_{i}^{(m)}\}$ , the $m$ object state vectors are individually passed through feed-forward neural networks $\pi_{\text{filter}}$ and $\pi_{\text{inputs}}$ . The scalar outputs of the filter graph are concatenated into a single vector $y_{i}\in\mathbb{R}^{m}$ and the softmax operation is applied. These outputs are then multiplied element-wise by the concatenated outputs of the network $\pi_{\text{inputs}}$ . In this way, the output of $\pi_{\text{filter}}$ acts as the attention filter, weighting the inputs by importance prior to summation. The elements of the weighted vector $z_{i}$ are then summed over the $m$ different objects, resulting in a single vector $s^{*}_{i}\in\mathbb{R}^{k}$ . This $s^{*}_{i}$ vector is then used as the input to the policy neural network.

We can now define bounds on the sample efficiency benefits of an invariant mapping. Define a state space $\mathcal{S}$ such that $\{s_{1},\ldots,s_{m}\}\in\mathcal{S}$ , where $m$ is the number of objects. Let each object $s_{i}$ take on $n$ unique values. Representing the states as ordered sets of $s_{i}$ results in a state-space size $|\mathcal{S}|$ that can be calculated from the expression for $m$ permutations of $n$ values. If all objects are exchangeable, there exists an abstraction that is permutation invariant. Since the order does not matter, the size of this abstract state $|\hat{\mathcal{S}}|$ can then be calculated from the expression for $m$ combinations of $n$ values.

[TABLE]

Using permutation invariant representation reduces the input space by a factor of $\frac{|S|}{|\hat{S}|}=\frac{1}{m!}$ compared to ordered set representation.

It can be shown that it is necessary and sufficient for a mapping $f$ to be invariant on all countable sets $\mathcal{X}$ if and only if it can be decomposed using transformations $\phi$ and $\rho$ , where $\phi$ and $\rho$ are any vector valued functions to the form Zaheer et al. (2017):

[TABLE]

It can be shown that the proposed attention mechanism may be decomposed to the above form to prove it is permutation invariant. For problems with multiple classes of exchangeable objects, a separate attention mechanism can be deployed for each class.

4. Experiments and Results

We conducted a series of experiments to validate the effectiveness of our proposed abstraction. The first two tasks are simple MDPs in which a scavenger agent navigates a continuous two-dimensional world to find food particles.

In the scavenger tasks, the state space contains vectors $s\in\mathbb{R}^{2m+2}$ , where $m$ is the number of target objects. The vector contains the relative position of each food particle as well as the ego position of the agent. The agent receives a reward of $+1.0$ when reaching a food particle, and a reward of $-0.05$ for every time-step otherwise. The episode terminates upon reaching a food particle or when the number of time-steps exceeds a limit. Scavenger Task 2 introduced poison particles in addition to the food particles (one poison for each food particle). If an agent reaches a poison particle, a reward of $-1.0$ is given and the episode terminates.

The third task is a convoy protection task with variable numbers of objects. The task requires a defender agent to protect a convoy that follows a predetermined path through a 2D environment. Attackers are spawned at the periphery of the environment during the episode, and the defender must block their attempts to approach the convoy. The state space is the space of vectors representing the state of each non-ego object in the environment. The episode terminates when all convoy members either reach the goal position or are reached by an attacker.

For each task, we trained a set of policies with the attention mechanism as well as a baseline policies that use a standard ordered set to represent the input space. Each policy was trained with Proximal Policy Optimization (PPO) Schulman et al. (2017), policy-gradient algorithm.

For each scavenger task, we trained a policy for on tasks having one to five food particles. The baseline policies were unable to achieve optimal performance for tasks with more than two food particles in either scavenger task. The policy trained with our attention mechanism was able to learn an optimal policy for all cases with no increase in the number of required training samples. For the convoy task, the abstracted policy approached optimal behavior after approximately 2,500 epochs, where the baseline policy showed no improvement after 10,000 epochs, as shown in fig. 2.

These experiments demonstrate the effectiveness of the proposed approach to enhance the scalability of the PPO policy gradient learning algorithm. Together, these experiments validate our hypothesis that leveraging object exchangability for input representation can improve the efficiency of deep reinforcement learning.

Bibliography15

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1(1)
2Abel et al . (2016) David Abel, D. Ellis Hershkowitz, and Michael L. Littman. 2016. Near Optimal Behavior via Approximate State Abstraction. In International Conference on Machine Learning (ICML) .
3Cobo et al . (2013) Luis C. Cobo, Charles L. Isbell, and Andrea L. Thomaz. 2013. Object Focused Q-learning for Autonomous Agents. In International Conference on Autonomous Agents and Multiagent Systems (AAMAS) .
4Diuk et al . (2008) Carlos Diuk, Andre Cohen, and Michael L. Littman. 2008. An Object-oriented Representation for Efficient Reinforcement Learning. In International Conference on Machine Learning (ICML) .
5Gu et al . (2016) Shixiang Gu, Timothy Lillicrap, Ilya Sutskever, and Sergey Levine. 2016. Continuous Deep Q-learning with Model-based Acceleration. In International Conference on Machine Learning (ICML) .
6Jaderberg et al . (2015) Max Jaderberg, Karen Simonyan, Andrew Zisserman, et al . 2015. Spatial transformer networks. In Advances in Neural Information Processing Systems (NIPS) .
7Luong et al . (2015) Minh-Thang Luong, Hieu Pham, and Christopher D Manning. 2015. Effective approaches to attention-based neural machine translation. ar Xiv preprint ar Xiv:1508.04025 (2015).
8Mnih et al . (2013) Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin A. Riedmiller. 2013. Playing Atari with Deep Reinforcement Learning. Nature 518 (2013), 529–533.