Robotic Navigation using Entropy-Based Exploration
Muhammad Usama, Dong Eui Chang

TL;DR
This paper explores entropy-based exploration in reinforcement learning for robotic navigation, demonstrating its effectiveness over epsilon-greedy strategies in simulation and real-world tests on a TurtleBot3.
Contribution
It introduces and evaluates entropy-based exploration as a novel strategy for improving reinforcement learning in robotic navigation tasks.
Findings
EBE outperforms epsilon-greedy in simulation
Policies trained with EBE generalize better to new environments
Real-world tests confirm the effectiveness of EBE without fine-tuning
Abstract
Robotic navigation concerns the task in which a robot should be able to find a safe and feasible path and traverse between two points in a complex environment. We approach the problem of robotic navigation using reinforcement learning and use deep -networks to train agents to solve the task of robotic navigation. We compare the Entropy-Based Exploration (EBE) with the widely used -greedy exploration strategy by training agents using both of them in simulation. The trained agents are then tested on different versions of the environment to test the generalization ability of the learned policies. We also implement the learned policies on a real robot in complex real environment without any fine tuning and compare the effectiveness of the above-mentioned exploration strategies in the real world setting. Video showing experiments on TurtleBot3 platform is available at…
| Action | Angular Velocity of Robot (rad/s) |
|---|---|
| 0 | -1.5 |
| 1 | -0.75 |
| 2 | 0.0 |
| 3 | 0.75 |
| 4 | 1.5 |
| Layer | input size | no. of neurons | activation | output size |
|---|---|---|---|---|
| FC 1 | RELU | |||
| FC 2 | RELU | |||
| Dropout, | - | |||
| FC 3 (output layer) | linear |
| Hyperparameter | Value |
|---|---|
| maximum episodes | 500 |
| maximum episode steps | 500 |
| discount factor | 0.9 |
| learning rate | 0.00025 |
| target network update frequency (in steps) | 2000 |
| batch size | 64 |
| replay memory size |
| Environment | EBE | -greedy |
|---|---|---|
| 4x4 Square | 3731.92 368.35 | 2202.93 425.89 |
| 8x8 Square | 2638.52 319.07 | 1662.57 402.94 |
| 2x2 Square | 4369.64 935.56 | 5556.55 377.02 |
| Triangle | 3655.37 915.96 | 1678.71 1257.50 |
| Pentagon | 2794.04 544.07 | 2065.00 882.77 |
| Environment | EBE | -greedy |
|---|---|---|
| 4x4 Square | 1125.45 775.82 | 450.19 760.14 |
| Pentagon | 890.99 641.19 | 432.28 210.37 |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
\affils
School of Electrical Engineering, KAIST,
Daejeon, 34141, Republic of Korea
usama, dechang@kaist.ac.kr
∗ Corresponding author
Robotic Navigation using Entropy-Based Exploration
Muhammad Usama and Dong Eui Chang*∗*
Abstract
Robotic navigation concerns the task in which a robot should be able to find a safe and feasible path and traverse between two points in a complex environment. We approach the problem of robotic navigation using reinforcement learning and use deep -networks to train agents to solve the task of robotic navigation. We compare the Entropy-Based Exploration (EBE) with the widely used -greedy exploration strategy by training agents using both of them in simulation. The trained agents are then tested on different versions of the environment to test the generalization ability of the learnt policies. We also implement the learned policies on a real robot in complex real environment without any fine tuning and compare the effectiveness of the above mentioned exploration strategies in the real world setting. Video showing experiments on TurtleBot3 platform is available at https://youtu.be/NHT-EiN_4n8.
keywords:
Deep Learning, Deep Neural Networks, Reinforcement Learning, Exploration, Entropy-Based Exploration, EBE.
1 Introduction
The presence of various kinds of mobile robots, such as walkers, manipulators etc. is rapidly increasing in the industrial and service sector. Mobile robots have the advantage of simplicity of manufacturing and mobility in the complex environments. Due to the growing interest of robot utilization in real world environments, the problem of autonomous robotic navigation has garnered increased interest of the research community in recent years. Navigation can be roughly described as the task of finding a feasible path between two points in the surrounding environment [1]. In robotic navigation, a robot is required to find a collision free and safe path from its current location to some goal location in an unknown, and sometimes dynamic, environment. Since the robot surroundings may contain several static and dynamic obstacles, it is important for the robot to actively seek its goal location while safely avoiding the obstacles and potentially dangerous and undesirable objects, if any. The solution of the complex problem of autonomous robot navigation involves dealing with issues of varied nature, such as acquisition and processing of sensory data, decision making, trajectory generation, trajectory tracking among others.
In this paper, we present the solution to autonomous robotic navigation problem using deep reinforcement learning. We use deep Q-learning to train agents for this task. Moreover, we adopt the entropy-based exploration (EBE) [2], an exploration strategy based on entropy, that is able to effectively and efficiently explore the state space resulting into better learning than the famous -greedy exploration heuristic that is widely used among the robotic community. We carry out experiments under diverse conditions to compare both these exploration strategies.
2 Preliminaries
2.1 Reinforcement Learning
Reinforcement learning is a sequential decision making process in which an agent interacts with an environment over discrete time steps; see [3] for an introduction. While in state at time step , the agent chooses an action from a discrete set of possible actions i.e. following a policy and gets feedback in form of a scalar called reward following a scalar reward function, . As a result, the environment transitions into next state according to transition probability distribution . We denote as discount factor and as initial state distribution.
The goal of any RL algorithm is to maximize the expected discounted return over a policy . The policy gives a distribution over actions in a state.
Following a stochastic policy , the state dependent action value function and the state value function are defined as
[TABLE]
[TABLE]
and
[TABLE]
2.2 Deep -Networks in Reinforcement Learning
To approximate high-dimensional action value function given in preceding section, we can use deep -network (DQN): with trainable parameters . To train this network, we minimize the expected squared error between the target and the current network prediction at iteration . The loss function to minimize is given as
[TABLE]
where represents the parameters of a separate target network that greatly improves the stability of the algorithm as shown in [4]. We refer the reader to [5] for formal introduction to deep neural networks.
2.3 Entropy
Let us have a discrete random variable that is completely defined by the set of values that it takes and its probability distribution . Here we assume that is a finite set, thus the random variable can only have finite realizations. The value is the probability that the random variable takes the value . The probability distribution must satisfy the following condition
[TABLE]
The entropy of a discrete random variable with probability distribution is defined as
[TABLE]
where the logarithm is taken to the base and we define by continuity that .
Intuitively, entropy quantifies the uncertainty associated with a random variable. The greater the entropy, the greater is the surprise associated with realization of a random variable.
3 Approach
In reinforcement learning, an agent is trained to perform a task by maximizing accumulative reward that it gets as a feedback for its interactions with the environment. We use deep -learning for autonomous robotic navigation where the state observation consists of °LiDAR scan and the distance between the current robot position and the desired goal position. The LiDAR scan is generated by means of 360 Laser Distance Sensor (LDS) that is present on the robot. The robot is provided with a constant linear velocity of 0.15m/s in the forward direction. In each state, the agent decides the angular velocity of the robot by choosing an action from a set of five possible actions, . Each action corresponds to a specific angular velocity of the robot and this correspondence is given in Table 1. The details about the reward function are given in Section 4.
Efficient exploration is crucial for effective learning in reinforcement. The need for good exploration strategy grows ever more in deep reinforcement learning where we have to deal with high dimensional state and action spaces and complex (possibly nonlinear) function approximation architectures. To carry out efficient exploration of the state space, we employ entropy-based exploration strategy that is based on the concept of quantifying the agent’s learning in a state based on the difference between the state dependent action values. EBE devises a probability distribution over actions in the state based on the entropy of state dependent action values and uses this probability distribution to explore the state space. For the states where some actions are decisively better than other actions, the probability distribution of a learnt policy is highly skewed towards the better actions, thereby, reducing the entropy in that state. Details about EBE exploration strategy are given in Section 5.
4 About the Reward Function
The reward function is based on the notion that the agent should receive positive reward for moving towards the goal position and negative reward for moving away the goal position. Therefore, we define the reward function as
[TABLE]
where and . Here, is the action applied by the agent and is the heading angle. Heading angle gives the angle between the direction the robot is travelling and the goal location. It is given by
[TABLE]
where is the goal angle between the goal position and the horizontal axis and is the yaw angle of robot with respect to horizontal axis. See Figure 1 for a visual explanation of heading angle .
gives the distance reward and is given by where is the current distance of the robot from the goal location. The function is visualized in the Figure 2. The agent is given a large positive reward for reaching the goal location while it is given a reward of for colliding into an obstacle or a wall.
5 Entropy-Based Exploration (EBE) for Robotic Navigation
Entropy-based exploration [2] (EBE) uses the difference between -values in a state as an estimate of agent’s learning progress in that state. Defining a probability distribution over the actions in a state , we have
[TABLE]
where . We then use to obtain state dependent entropy, , as follows
[TABLE]
where is the base of logarithm. We note that may be greater than when , therefore, we normalize between 0 and 1. The maximum value the entropy can take is , therefore, we define a scaled entropy as follows:
[TABLE]
Here , therefore, we have . Given in a state from equation (3), the agent explores with probability i.e. it behaves randomly. In practice, entropy-based exploration is similar to the famous -greedy exploration method with replaced with state dependent .
6 Experiments
In this section, we share details about various experiments performed on the Turtlebot3 platform. Experiments were performed in different environment configurations as detailed below.
We use deep neural network to approximate the high dimensional state dependent action value function. The architecture of deep neural network used is given in Table 2.
6.1 Environments with no obstacles
In this configuration, the environment consists of square maze in which the robot is tasked to the reach randomly generated goal positions. The episode ends when the robot collides with either of the maze walls or it has consumed 500 time steps. The task of the agent is to go to randomly generated goal positions without colliding with the maze walls. The reward function is the same as described in Section 4.
We compare the Entropy-Based Exploration (EBE) strategy with the famous -greedy exploration heuristic. For the baseline -greedy exploration heuristic, the exploration fraction in each episode, , is given for episode as
[TABLE]
for where is the total number of episodes. Here, , and .
The agents are trained on a 4x4 square maze shown in Figure 3(a) and the training progress is shown in Figure 3(b). Both agents are trained using deep -learning algorithm for 500 episodes. The details about the hyperparameters used in the training process is given in Table 3.
We see in Figure 3(b) that EBE strategy shows better performance both in terms of learning high reward policy as well as learning speed depicting efficient exploration. The policy learnt using -greedy exploration strategy settles at much lower score.
To test the robustness and generalization ability of the learnt policies, we test the agents trained on 4x4 square maze on 2x2, 4x4 and 8x8 square mazes. Also, we test both these agents on mazes of different shapes including a triangular maze shown in Figure 4(a) and a pentagonal maze shown in Figure 4(b). Note that mazes shown in Figure 4 were never shown to the agent during the training process. The results of these test experiments are reported in Table 4. We see in Table 4 that policy learnt using EBE exploration shows better performance on 4x4 and 8x8 square mazes. It, however, shows lags behind the policy learnt using -greedy policy on 2x2 square maze. The low score by EBE policy on 2x2 maze can be explained by noting that the EBE policy learns to take long radius turns as compared to the -greedy policy. Since 2x2 maze is essentially smaller in size as compared to 4x4 and 8x8 mazes, for the goal positions located in close proximity of the maze walls, they possibility of collision with the walls is greater for EBE policy as compared -greedy policy, resulting in low score for EBE policy. Table 4 also shows the results of experiments with triangle and pentagonal mazes where the EBE policy obtains better scores than the -greedy policy. Note that these agents were trained only on the 4x4 maze shown in Figure 3(a).
6.2 Environments having obstacles
Here, the robot again has to reach randomly generated goal positions but in the presence of additional obstacles. These are static obstacles and the episode ends when the robot collides with either of these static obstacles or any of the maze walls.
The training conditions, hyperparameters, baseline -greedy exploration schedule and reward function are same as explained in Section 6.1. The training environment is shown in Figure 5(a) while the training progress comparing the EBE and -greedy exploration strategies are shown in Figure 5(c). It can be seen in Figure 5(b) the EBE shows better learning in terms of learnt policy and learning speed. Both policies are tested on a 4x4 square maze (Figure 5(a)) and a pentagonal maze (Figure 5(b)).
The experiment results are given in Table 5. We see that EBE exploration strategy outperforms the -greedy exploration strategy on both environments with considerate margin depicting effective exploration.
7 Conclusion
We consider the problem of autonomous navigation for robotics and apply deep reinforcement learning to solve this problem. We use entropy-based exploration strategy and the widely famous -greedy exploration heuristic to train agents using deep -learning algorithm to solve achieve the autonomous robotic navigation. We performed experiments under various environmental conditions to test the effectiveness of both of the above mentioned exploration strategies.
8 Acknowledgement
This research has been in part supported by the ICT R&D program of MSIP/IITP [2016-0-00563, Research on Adaptive Machine Learning Technology Development for Intelligent Autonomous Digital Companion].
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] H. Choset, K. Lynch, S. Hutchinson, G. Kantor, W. Burgard, L. Kavraki, and S. Thrun, Principles of Robot Motion: Theory, Algorithms, and Implementations . MIT Press, 5 2005.
- 2[2] M. Usama and D. E. Chang, “Learning-driven exploration for reinforcement learning,” Co RR , vol. submit/2733234, 2019.
- 3[3] R. S. Sutton and A. G. Barto, Reinforcement learning - an introduction . Adaptive computation and machine learning, MIT Press, 1998.
- 4[4] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran, D. Wierstra, S. Legg, and D. Hassabis, “Human-level control through deep reinforcement learning,” Nature , vol. 518, pp. 529–533, Feb. 2015.
- 5[5] A. L. Caterini and D. E. Chang, Deep Neural Networks in a Mathematical Framework . Springer Publishing Company, Incorporated, 1st ed., 2018.
