Deep Reinforcement Learning using Genetic Algorithm for Parameter   Optimization

Adarsh Sehgal; Hung Manh La; Sushil J. Louis; Hai Nguyen

arXiv:1905.04100·cs.NE·May 13, 2019

Deep Reinforcement Learning using Genetic Algorithm for Parameter Optimization

Adarsh Sehgal, Hung Manh La, Sushil J. Louis, Hai Nguyen

PDF

Open Access 2 Repos

TL;DR

This paper introduces a genetic algorithm-based method to optimize parameters in deep reinforcement learning, specifically for DDPG with HER, resulting in faster and better performance in robotic manipulation tasks.

Contribution

It presents a novel approach using genetic algorithms to optimize RL parameters, improving learning speed and effectiveness in complex robotic tasks.

Findings

01

Faster learning compared to original algorithms

02

Improved task performance in robotic manipulation

03

Effective parameter optimization using GA

Abstract

Reinforcement learning (RL) enables agents to take decision based on a reward function. However, in the process of learning, the choice of values for learning algorithm parameters can significantly impact the overall learning process. In this paper, we use a genetic algorithm (GA) to find the values of parameters used in Deep Deterministic Policy Gradient (DDPG) combined with Hindsight Experience Replay (HER), to help speed up the learning agent. We used this method on fetch-reach, slide, push, pick and place, and door opening in robotic manipulation tasks. Our experimental evaluation shows that our method leads to better performance, faster than the original algorithm.

Tables1

Table 1. TABLE I : Original vs Optimal values of parameters

Parameters	Original	Optimal
$γ$	0.98	0.88
$τ$	0.95	0.184
$α_{a c t o r}$	0.001	0.001
$α_{c r i t i c}$	0.001	0.001
$ϵ$	0.3	0.055
$η$	0.2	0.774

Equations12

Q^{*} (s, a) = E_{s^{'} p (.∣ s, a))} [r (s, a) + γ a^{'} \in A ma x Q^{*} (s^{'}, a^{'}))] .

Q^{*} (s, a) = E_{s^{'} p (.∣ s, a))} [r (s, a) + γ a^{'} \in A ma x Q^{*} (s^{'}, a^{'}))] .

θ^{Q^{'}} τ θ^{Q} + (1 - τ) θ^{Q^{'}},

θ^{Q^{'}} τ θ^{Q} + (1 - τ) θ^{Q^{'}},

θ^{μ^{'}} τ θ^{μ} + (1 - τ) θ^{μ^{'}} .

y_{i} = r_{i} + γ Q^{'} (s_{i + 1}, μ^{'} (s_{t + 1} ∣ θ^{μ^{'}}) ∣ θ^{Q^{'}}),

y_{i} = r_{i} + γ Q^{'} (s_{i + 1}, μ^{'} (s_{t + 1} ∣ θ^{μ^{'}}) ∣ θ^{Q^{'}}),

Q (s_{t}, a_{t}) Q (s_{t}, a_{t}) + α [r_{t + 1} + γ Q (s_{t + 1}, a_{t + 1})

Q (s_{t}, a_{t}) Q (s_{t}, a_{t}) + α [r_{t + 1} + γ Q (s_{t + 1}, a_{t + 1})

- Q (s_{t}, a_{t})] .

a_{t} = {a_{t}^{*} w i t h p r o babi l i t y 1 - ϵ, r an d o m a c t i o n w i t h p r o babi l i t y ϵ .

a_{t} = {a_{t}^{*} w i t h p r o babi l i t y 1 - ϵ, r an d o m a c t i o n w i t h p r o babi l i t y ϵ .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics · Evolutionary Algorithms and Applications · Robotic Path Planning Algorithms

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · Experience Replay

Full text

Deep Reinforcement Learning using Genetic Algorithm for Parameter Optimization

Adarsh Sehgal, Hung Manh La, Sushil J. Louis, Hai Nguyen Adarsh Sehgal, Hai Nguyen and Dr. Hung La are with the Advanced Robotics and Automation (ARA) Laboratory. Dr. Sushil Louis is professor of the Department of Computer Science and Engineering, University of Nevada, Reno, NV 89557, USA. Corresponding author: Hung La, email: [email protected] material is based upon work supported by the National Aeronautics and Space Administration (NASA) Grant No. NNX15AI02H issued through the NVSGC-RI program under sub-award No. 19-21, and the RID program under sub-award No. 19-29, and the NVSGC-CD program under sub-award No. 18-54. This work is also partially supported by the Office of Naval Research under Grant N00014-17-1-2558.

Abstract

Reinforcement learning (RL) enables agents to take decision based on a reward function. However, in the process of learning, the choice of values for learning algorithm parameters can significantly impact the overall learning process. In this paper, we use a genetic algorithm (GA) to find the values of parameters used in Deep Deterministic Policy Gradient (DDPG) combined with Hindsight Experience Replay (HER), to help speed up the learning agent. We used this method on fetch-reach, slide, push, pick and place, and door opening in robotic manipulation tasks. Our experimental evaluation shows that our method leads to better performance, faster than the original algorithm.

I INTRODUCTION

Q-learning methods have been applied on a variety of tasks by autonomous robots [1], and much research has been done in this field starting many years ago [2], with some work specific to continuous action spaces [3, 4, 5, 6] and others on discrete action spaces [7]. Reinforcement Learning (RL) has been applied to locomotion [8] [9] and also to manipulation [10, 11].

Much work specific to robotic manipulators also exists [12, 13]. Some of this work used fuzzy wavelet networks [14], others used neural networks to accomplish their tasks [15] [16]. Off-policy algorithms such as the Deep Deterministic Policy Gradient algorithm (DDPG) [17] and Normalized Advantage Function algorithm (NAF) [18] are helpful for real robot systems. A complete review of recent deep reinforcement learning methods for robot manipulation is given in [19]. We are specifically using DDPG combined with Hindsight Experience Replay (HER) [20] for our experiments. Recent work on using experience ranking to improve the learning speed of DDPG + HER was reported in [21].

The main contribution of this paper is a demonstration of better final performance at several manipulation tasks using a Genetic Algorithm (GA) to find DDPG and HER parameter values that lead more quickly to better performance at these tasks. Our experiments revealed that learning algorithm parameters are non-linearly related to task performance and learning speed. Rather, success rate can vary significantly based on the values of the parameters used in RL. In the following sections, we describe the manipulation tasks, the DDPG + HER algorithms, and the parameters that affect performance for these algorithms. Initial experimental results showing performance and speed gains when using a GA to search for good parameter values then provide evidence that GAs find good parameter values leading to better task performance, faster.

The paper is organized as follows: In Section 2, we present related work. Section 3 describes the DDPG + HER algorithms. In Section 4, we describe the GA being used to find the values of parameters. Section 5 then describes our learning tasks and experiments and our experimental results. The last section provides conclusions and possible future research.

II RELATED WORK

RL has been widely used in training/teaching both a single robot [22, 23] and a multi-robot system [24, 25, 26, 27, 28]. Previous work has also been done on both model-based and model-free learning algorithms. Applying model-based learning algorithms to real world scenarios, rely significantly on a model-based teacher to train deep network policies.

Similarly, there is also much work in GA’s [29] [30] and the GA operators of crossover and mutation [31], applied to a variety of problem. GA has been specifically applied to variety of RL problems [32, 33, 34, 31].

In this paper, we use model-free RL with continuous action spaces and deep neural network. Our work is built on existing work using the same techniques applied to robotic manipulator [17] [20]. Specifically, we use a GA to search for good DDPG + HER algorithm parameters and compare it with original values of parameters [35], and hence the success rates. DDPG + HER, a RL algorithm using deep neural networks in continuous action spaces has been successfully used for robotic manipulation tasks, and our GA improves on this work by finding learning algorithm parameters that needs fewer epochs (one epoch is a single pass through full training set) to learn better task performance.

III BACKGROUND

III-A Reinforcement Learning

Consider a standard RL setup consisting of a learning agent, which interacts with an environment. An environment can be described by a set of variables where $S$ is the set of states, $A$ is the set of actions, $p(s_{0})$ is a distribution of initial states, $r:S\times A\xrightarrow{}R$ , $p(s_{t+1}|s_{t},a_{t})$ are transition probabilities and $\gamma\in[0,1]$ is a discount factor.

A deterministic policy maps from states to actions: $\pi:S\xrightarrow{}A$ . The beginning of every episode is marked by sampling an initial state $s_{0}$ . For each timestep $t$ , the agent performs an action based on the current state: $a_{t}=\pi(s_{t})$ . The performed action gets a reward $r_{t}=r(s_{t},a_{t})$ , and the distribution $p(.|s_{t},a_{t})$ helps to sample the environment’s new state. The total return is: $R_{t}=\sum_{i=T}^{\infty}\gamma^{i-t}r_{i}$ . The agent’s goal is to try to maximize its expected return $E[R_{t}|s_{t},a_{t}]$ and an optimal policy denoted by $\pi^{*}$ can be defined as any policy $\pi^{*}$ , such that $Q^{\pi^{*}}(s,a)\geq Q^{\pi}(s,a)$ for every $s\in S,a\in A$ and any policy $\pi$ . The optimal policy, which has the same Q-function, is called an optimal Q-function, $Q^{*}$ , which satisfies the Bellman equation:

[TABLE]

III-B Deep Q-Networks(DQN)

A Deep Q-Networks (DQN) [36] is defined as a model free reinforcement learner, designed for discrete action spaces. In a DQN, a neural network $Q$ is maintained, which approximates $Q^{*}$ . $\pi_{Q}(s)=argmax_{a\in A}Q(s,a)$ denotes a greedy policy w.r.t. $Q$ . A - greedy policy takes a random action with probability $\epsilon$ and action $\pi_{Q}(s)$ with probability $1-\epsilon$ .

Episodes are generated during training using a $\epsilon$ -greedy policy. A Replay buffer stores transition tuples $(s_{t},a_{t},r_{t},s_{t+1})$ experienced during training. The neural network training is interlaced by generation of new episodes. A Loss $\mathcal{L}$ defined by $\mathcal{L}=E(Q(s_{t},a_{t})-y_{t})^{2}$ where $y_{t}=r_{t}+\gamma max_{a^{\prime}\in A}Q(s_{t+1},a^{\prime})$ and tuples $(s_{t},a_{t},r_{t},s_{t+1})$ are being sampled from the replay buffer.

The target network changes at a slower pace than the main network, which is used to measure targets $y_{t}$ . The weights of the target networks can be set to the current weights of the main network [36]. Polyak-averaged parameters [37] can also be used.

III-C Deep Deterministic Policy Gradients (DDPG)

In Deep Deterministic Policy Gradients (DDPG), there are two neural networks: an Actor and a Critic. The actor neural network is a target policy $\pi:S\xrightarrow{}A$ , and critic neural network is an action-value function approximator $Q:S\times A\xrightarrow{}R$ . The critic network $Q(s,a|\theta^{Q})$ and actor network $\mu(s|\theta^{\mu})$ are randomly initialized with weights $\theta^{Q}$ and $\theta^{\mu}$ .

A behavioral policy is used to generate episodes, which is a noisy variant of target policy, $\pi_{b}(s)=\pi(s)+\mathcal{N}(0,1)$ . The training of a critic neural network is done like the Q-function in DQN but where the target $y_{t}$ is computed as $y_{t}=r_{t}+\gamma Q(s_{t+1},\pi(s_{t+1}))$ , where $\gamma$ is the discounting factor. The loss $\mathcal{L}_{a}=-E_{a}Q(s,\pi(s))$ is used to train the actor network.

III-D Hindsight Experience Replay (HER)

Hindsight Experience Reply (HER) tries to mimic human behavior to learn from failures. The agent learns from all episodes, even when it does not reach the original goal. Whatever state the agent reaches, HER considers that as the modified goal. Standard experience replay only stores the transition $(s_{t}||g,a_{t},r_{t},s_{t+1}||g)$ with original goal $g$ . HER tends to store the transition $(s_{t}||g^{\prime},a_{t},r^{\prime}_{t},s_{t+1}||g^{\prime})$ to modified goal $g^{\prime}$ as well. HER does great with extremely sparse rewards and is also significantly better for sparse rewards than shaped ones.

III-E Genetic Algorithm (GA)

Genetic Algorithms (GAs) [29, 38, 39] were designed to search poorly-understood spaces, where exhaustive search may not be feasible, and where other search approaches perform poorly. When used as function optimizers, GAs try to maximize a fitness tied to the optimization objective. Evolutionary computing algorithms in general and GAs specifically have had much empirical success on a variety of difficult design and optimization problems. They start with a randomly initialized population of candidate solution typically encoded in a string (chromosome). A selection operator focuses search on promising areas of the search space while crossover and mutation operators generate new candidate solutions. We explain our specific GA in the next section.

IV DDPG + HER and GA

In this section, we present the primary contribution of our paper: The genetic algorithm searches through the space of parameter values used in DDPG + HER for values that maximize task performance and minimize the number of training epochs. We target the following parameters: discounting factor $\gamma$ ; polyak-averaging coefficient $\tau$ [37]; learning rate for critic network $\alpha_{critic}$ ; learning rate for actor network $\alpha_{actor}$ ; percent of times a random action is taken $\epsilon$ ; and standard deviation of Gaussian noise added to not completely random actions as a percentage of maximum absolute value of actions on different coordinates $\eta$ . The range of all the parameters is 0-1, which can be justified using the equations following in this section.

Our experiments show that adjusting the values of parameters did not increase or decrease the agent’s learning in a linear or easily discernible pattern. So, a simple hill climber will probably not do well in finding optimized parameters. Since GAs were designed for such poorly understood problems, we use our GA to optimize these parameter values.

Specifically, we use $\tau$ , the polyak-averaging coefficient to show the performance non-linearity for values of $\tau$ . $\tau$ is used in the algorithm as show in Equation (2):

[TABLE]

Equation (3) shows how $\gamma$ is used in the DDPG + HER algorithm, while Equation (4) describes the Q-Learning update. α denotes the learning rate. Networks are trained based on this update equation.

[TABLE]

Since we have two kinds of networks, we will need two learning rates, one for the actor network ( $\alpha_{actor}$ ), another for the critic network ( $\alpha_{critic}$ ). Equation (5) explains the use of percent of times that a random action is taken, $\epsilon$ .

[TABLE]

Figure 1 shows that when the value of $\tau$ is modified, there is a change in the agent’s learning, further emphasizing the need to use a GA. The original (untuned) value of $\tau$ in DDPG was set to 0.95, and we are using 4 CPUs. All the values of $\tau$ are considered up to two decimal places, in order to see the change in success rate with change in value of the parameter. From the plots, we can clearly tell that there is a great scope of improvement from the original success rate.

Algorithm 1 explains the integration of DDPG + HER with a GA, which uses a population size of 30 over 30 generations. We are using ranking selection [40] to select parents. The parents are probabilistically based on rank, which is in turn decided based on the relative fitness (performance). Children are then generated using uniform crossover [41]. We are also using flip mutation [39] with probability of mutation to be 0.1. We use a binary chromosome to encode each parameter and concatenate the bits to form a chromosome for the GA. The six parameters are arranged in the order: polyak-averaging coefficient; discounting factor; learning rate for critic network; learning rate for actor network; percent of times a random action is taken and standard deviation of Gaussian noise added to not completely random actions as a percentage of maximum absolute value of actions on different coordinates. Since each parameter requires 11 bits to be represented to three decimal places, we need 66 bits for 6 parameters. These string chromosomes then enable domain independent crossover and mutation string operators to generate new parameter values. We consider parameter values up to three decimal places, because small changes in values of parameters causes considerable change in success rate. For example, a step size of 0.001 is considered as the best fit for our problem.

The fitness for each chromosome (set of parameter values) is defined by the inverse of number of epochs it takes for the learning agent to reach close to maximum success rate ( $\geq 0.85$ ) for the very first time. Fitness is the inverse of number of epochs because GA always maximizes the objective function and this converts our minimization of number of epochs to a maximization problem. Since each fitness evaluation takes significant time an exhaustive search of the $2^{66}$ size search space is not possible and we thus use GA search.

V EXPERIMENT and RESULTS

Figure 4, shows the environments used to test robot learning on five different tasks: FetchPick&Place-v1, FetchPush-v1, FetchReach-v1, FetchSlide-v1, and DoorOpening . We ran the GA separately on these environments to check the effectiveness of our algorithm and compared performance with the original values of the parameters. Figure 2 (a) shows the result of our experiment with FetchPush-v1, while Figure 3 (a) shows the results with FetchSlide-v1. We let the system run with GA to find the optimal parameters τ and γ. Since the GA is probabilistic, we show results from 10 runs of the GA and the results show that the optimized parameters found by the GA can lead to better performance. The learning agent can run faster, and can reach the maximum success rate, faster. In Figure 2 (b), we show one learning run for the original parameter set and the average learning over these 10 different runs of the GA.

Figure 3 (b) compares one run for original with averaged 2 runs for optimizing parameters $\tau$ and $\gamma$ . For this task, we have run it for only 2 runs because these tasks can take a few hours for one run. The results shown in Figures 2 and 3 show changes when only two parameters are being optimized as we tested and debugged the genetic algorithm be we can see the possibility for performance improvement. Our results from optimizing all five parameters justify this optimism and are described next.

The GA was then run to optimize all parameters and these results were plotted in Figure 4 for all the tasks. Table I compares the GA found parameters with the original parameters used in the RL algorithm. Though the learning rates $\alpha_{actor}$ and $\alpha_{critic}$ are same as their original values, the other four parameters have different values than original. The plots in the figure 4 shows that the GA found parameters outperformed the original parameters, indicating that the learning agent was able to learn faster. All the plots in this figure are averaged over 10 runs.

VI DISCUSSION and FUTURE WORK

In this paper, we showed initial results that demonstrated that a genetic algorithm can tune reinforcement learning algorithm parameters to achieve better performance, faster at six manipulation tasks. We discussed existing work in reinforcement learning in robotics, presented an algorithm, which integrates DDPG + HER with GA to optimize the number of epochs required to achieve maximal performance, and explained why a GA might be suitable for such optimization. Initial results bore out the assumption that GAs are a good fit for such parameter optimization and our results on the six manipulation tasks show that the GA can find parameter values that lead to faster learning and better (or equal) performance at our chosen tasks. We thus provide further evidence that heuristic search as performed by genetic and other similar evolutionary computing algorithms are a viable computational tool for optimizing reinforcement learning performance in multiple domains.

APPENDIX

We have the code for this paper on github: https://github.com/aralab-unr/ReinforcementLearningWithGA. The parameters used in this paper can be found in baselines.her.experiment.config module. The parameters are: discounting factor; polyak-averaging coefficient; learning rate for critic network; learning rate for actor network; percent of times a random action is taken; and standard deviation of Gaussian noise added to not completely random actions as a percentage of maximum absolute value of actions on different coordinates, corresponds to $gamma$ ; $polyak$ ; $Q\_lr$ ; $pi\_lr$ ; $random\_eps$ , $noise\_eps$ , respectively in the code.

Bibliography41

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] H. M. La, R. Lim, and W. Sheng, “Multirobot cooperative learning for predator avoidance,” IEEE Transactions on Control Systems Technology , vol. 23, no. 1, pp. 52–63, Jan 2015.
2[2] C. J. Watkins and P. Dayan, “Q-learning,” Machine learning , vol. 8, no. 3-4, pp. 279–292, 1992.
3[3] C. Gaskett, D. Wettergreen, and A. Zelinsky, “Q-learning in continuous state and action spaces,” in Australasian Joint Conference on Artificial Intelligence . Springer, 1999, pp. 417–428.
4[4] K. Doya, “Reinforcement learning in continuous time and space,” Neural computation , vol. 12, no. 1, pp. 219–245, 2000.
5[5] H. V. Hasselt and M. A. Wiering, “Reinforcement learning in continuous action spaces,” 2007.
6[6] L. C. Baird, “Reinforcement learning in continuous time: Advantage updating,” in Neural Networks, 1994. IEEE World Congress on Computational Intelligence., 1994 IEEE International Conference on , vol. 4. IEEE, 1994, pp. 2448–2453.
7[7] Q. Wei, F. L. Lewis, Q. Sun, P. Yan, and R. Song, “Discrete-time deterministic q 𝑞 q -learning: A novel convergence analysis,” IEEE transactions on cybernetics , vol. 47, no. 5, pp. 1224–1237, 2017.
8[8] N. Kohl and P. Stone, “Policy gradient reinforcement learning for fast quadrupedal locomotion,” in Robotics and Automation, 2004. Proceedings. ICRA’04. 2004 IEEE International Conference on , vol. 3. IEEE, 2004, pp. 2619–2624.