Autonomous Exploration and Mapping for Mobile Robots via Cumulative Curriculum Reinforcement Learning
Zhi Li, Jinghao Xin, and Ning Li

TL;DR
This paper introduces a novel curriculum reinforcement learning framework with a new state representation and a lightweight simulator to enhance autonomous exploration and mapping efficiency for mobile robots, addressing sample efficiency and adaptability issues.
Contribution
It proposes the Cumulative Curriculum Reinforcement Learning (CCRL) framework, a new state representation, and a lightweight simulator to improve DRL-based robot exploration and mapping.
Findings
CCRL mitigates catastrophic forgetting in DRL models.
CCRL improves sample efficiency and generalization.
The lightweight simulator accelerates training significantly.
Abstract
Deep reinforcement learning (DRL) has been widely applied in autonomous exploration and mapping tasks, but often struggles with the challenges of sampling efficiency, poor adaptability to unknown map sizes, and slow simulation speed. To speed up convergence, we combine curriculum learning (CL) with DRL, and first propose a Cumulative Curriculum Reinforcement Learning (CCRL) training framework to alleviate the issue of catastrophic forgetting faced by general CL. Besides, we present a novel state representation, which considers a local egocentric map and a global exploration map resized to the fixed dimension, so as to flexibly adapt to environments with various sizes and shapes. Additionally, for facilitating the fast training of DRL models, we develop a lightweight grid-based simulator, which can substantially accelerate simulation compared to popular robot simulation platforms such as…
| Algorithm 1 Cumulative Curriculum Reinforcement Learning (CCRL) | |
|---|---|
| 1: | Initialize the DRL model |
| 2: | Let be a sequence of progressively more difficult environments |
| 3: | Vectorize each environment to get N parallel copies |
| (This step is not necessary, here . In practice, we find that vectorizing | |
| each map can make full use of computing resources and speed up sampling.) | |
| 4: | Initialize the vectorized enviromments |
| 5: | for do |
| 6: | |
| ( means integrating previous and new tasks into the vectorized environments) | |
| 7: | while not A predefined performance criterion is satisfied do |
| 8: | Collect transitions on and optimize the DRL model |
| 9: | end while |
| 10: | end for |
| Level-1 | Level-2 | Level-3 | Level-4 | Level-5 | |
|---|---|---|---|---|---|
| CCPPO | 170.4 12.5 | 13.6 16.9 | 159.2 25.0 | 301.6 92.7 | 68.0 62.2 |
| PPO | 144.8 17.0 | 139.2 7.8 | 167.2 7.8 | 413.6 47.9 | 241.6 45.0 |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Reinforcement Learning in Robotics · Domain Adaptation and Few-Shot Learning
Autonomous Exploration and Mapping for Mobile Robots via Cumulative Curriculum Reinforcement Learning
Zhi Li1, Jinghao Xin1, and Ning Li1 *This work is supported by National Nature Science Foundation under Grant (62273230).Zhi Li, Jinghao Xin and Ning Li are with Department of Automation, Shanghai Jiao Tong University, Shanghai 200240, P.R. China, and also with Key Laboratory of System Control and Information Processing, Ministry of Education of China, Shanghai 200240, China, and also with Shanghai Engineering Research Center of Intelligent Control and Management, Shanghai 200240, China (E-mail: lizhibeaman, xjhzsj2019, [email protected])
Abstract
Deep reinforcement learning (DRL) has been widely applied in autonomous exploration and mapping tasks, but often struggles with the challenges of sampling efficiency, poor adaptability to unknown map sizes, and slow simulation speed. To speed up convergence, we combine curriculum learning (CL) with DRL, and first propose a Cumulative Curriculum Reinforcement Learning (CCRL) training framework to alleviate the issue of catastrophic forgetting faced by general CL. Besides, we present a novel state representation, which considers a local egocentric map and a global exploration map resized to the fixed dimension, so as to flexibly adapt to environments with various sizes and shapes. Additionally, for facilitating the fast training of DRL models, we develop a lightweight grid-based simulator, which can substantially accelerate simulation compared to popular robot simulation platforms such as Gazebo. Based on the customized simulator, comprehensive experiments have been conducted, and the results show that the CCRL framework not only mitigates the catastrophic forgetting problem, but also improves the sample efficiency and generalization of DRL models, compared to general CL as well as without a curriculum. Our code is available at https://github.com/BeamanLi/CCRL_Exploration.
I Introduction
Autonomous exploration and mapping means that mobile robots actively explore the priori unknown environment without collisions while constructing a map of the surroundings as entirely as possible[1], which has been widely applied to military reconnaissance[2], search and rescue[8], planetary exploration [3], and other fields.
Traditional autonomous exploration methods mainly include frontier-based [4] and information-based[5] strategies. The former determines the robot’s moving target according to frontiers, which are defined as the boundary regions between the free and unknown space. The latter constructs complicated optimization problems based on mutual information. In general, computational complexity and reliance on handcrafted expert features limit the applications of traditional exploration methods in the real world[2].
Thanks to breakthroughs in deep reinforcement learning (DRL)[6, 7] in the last decade, some researchers have applied DRL to autonomous exploration tasks. Most previous DRL-based exploration models[8, 9, 10, 11] need to be combined with traditional exploration or navigation algorithms, referred to as 2-stage strategies [2], hence still suffer from the model complexity issue mentioned above. In our previous work[1], we proposed an end-to-end DRL-based exploration model that directly outputs discrete control commands, but with a major limitation of poor adaptability to diverse environment sizes. In this paper, we introduce an improved map representation that considers a local egocentric map and a global exploration map resized to the fixed dimension, allowing flexible adaptation to maps with varied sizes and shapes.
One of the critical challenges when applying DRL to robotic systems, especially in the end-to-end training paradigm, is sample efficiency. Due to complicated dynamics and high-dimensional image input, training a reinforcement learning (RL) agent to learn an optimal policy may require millions of interactions with the environment[12]. This issue can be aggravated by the slow simulation speed when using popular robot simulation platforms such as Gazebo[13]. The enormous sample steps and slow simulation speed make the training time of DRL models in robotic applications prohibitive. To address the above issues, we propose the following solutions:
On the one hand, we apply curriculum learning (CL) [14] to improve sample efficiency and speed up convergence, which starts learning from simple tasks and gradually increases the difficulty of the tasks[12]. However, when combining CL with RL, especially in the context of modern DRL, a crucial dilemma is catastrophic forgetting[15]: the knowledge learned from previous tasks may be gradually lost when training on new tasks. In order to alleviate this issue, we first propose a Cumulative Curriculum Reinforcement Learning (CCRL) training framework: Instead of being directly transferred to the following more difficult environment when performing a curriculum task switch, the agent will interact with the vectorized environments composed of the previous and new tasks with the aid of AsyncVectorEnv in OpenAI Gym[16].
On the other hand, we design a lightweight grid-based autonomous exploration simulator specifically for end-to-end training and CL, which contains a series of training maps with progressively increasing difficulty based on four typical map features adapted from [17]. Our grid-based simulator supports customized maps and can significantly accelerate the simulation compared to Gazebo.
The main contributions of this paper are summarized here:
(1) We propose a Cumulative Curriculum Reinforcement Learning (CCRL) training framework to moderate the catastrophic forgetting issue faced by general CL while improving the sample efficiency and generalization of DRL models.
(2) We present an end-to-end DRL-based autonomous exploration and mapping model with a size-adaptive map representation, which can flexibly adapt to environments with different sizes and shapes.
(3) We customize a concise grid-based autonomous exploration simulator specifically for end-to-end training and curriculum learning, facilitating fast implementation, verification, and comparison of DRL algorithms.
II Related Work
Traditional Exploration Methods. The frontier-based exploration is most widespread among the traditional exploration methods, first proposed by Yamauchi et al.[4], where the robot always naively navigated to the nearest frontier. In the following decades, various improvements to the frontier-based strategy have been developed, mainly focusing on how to select the most promising frontier, including path cost[18], information gain[19], potential field[20], etc. Traditional exploration methods often rely on handcrafted expert features and strong assumptions about specific tasks, which decreases the adaptive capacity for diverse unknown environments. Besides, as the map size and robot action space expand, the computational complexity and decision time of traditional methods will grow substantially[21].
DRL-based Exploration Methods. Niroui et al.[8] combined DRL with the traditional frontier-based methods, where the A3C[22] model output the weight parameters of each frontier. The frontier with the lowest cost calculated by the predefined cost function would be assigned to the robot. In [9], an A3C policy selected one of the six sector subregions centered on the robot as the next visiting direction, and the robot navigated to the target in this candidate subregion determined by the next-best-view algorithm [23]. Wang et al. [11] presented an autonomous exploration method based on spatial action maps, where action commands could be represented as pixels on the map. The DDQN[24] algorithm was employed to encode the Q-value of each pixel, and the robot chose to move to the target point with the highest Q-value. Since the above DRL-based models need to be combined with traditional exploration or navigation algorithms, the problem of computational burden described above still exists. In addition, most previous DRL models use global or local grid maps with fixed dimensions as the state space, which are less adaptable to varying environment sizes. In this paper, we propose a size-adaptive end-to-end DRL-based exploration model, which directly outputs discrete control commands and flexibly adapts to diverse maps.
Curriculum Learning. To speed up convergence and improve training performance or sample efficiency, curriculum learning techniques have been widely used in DRL-based mobile robot navigation (e.g., gradually increasing the number of obstacles and the distance between the robot and the target[25]) and exploration (e.g., training an agent to explore environments with gradually growing sizes[26, 27]) tasks. To address the catastrophic forgetting problem existing in the general CL, Rusu et al.[15] proposed a “progressive neural network”, which trained a new network “column” for each new task. When training subsequent columns, parameters from previous columns would be frozen. The main limitation of this method is that the number of model parameters and the inference time will increase with the number of tasks. In contrast, our CCRL framework only increases the number of vectorized environments during training, and the neural network structure is consistently fixed.
III Methods
III-A End-to-end DRL-based Autonomous Exploration and Mapping
III-A1 Problem Formulation
We formulate the autonomous exploration and mapping task as a Markov decision process (MDP). At timestep , the agent observes the state of the environment , takes the action according to the policy , receives the reward , and then transits to the next state , where is the state space, is the action space. The goal of the RL agent is to learn an optimal policy to maximize the expectation of discounted cumulative rewards , where is the discount factor. In this paper, we implement Proximal Policy Optimization (PPO)[28] as the underlying DRL algorithm, which is a popular and powerful on-policy DRL algorithm and has been widely applied in locomotion control[29], video games[30], robot navigation [31], etc. The overview of our end-to-end DRL-based autonomous exploration and mapping model is shown in Fig. 1.
III-A2 State Space
We propose a novel map representation that combines local and global map information, as well as the agent’s location, which can adapt to different sizes of environments.
Local Egocentric Map (LEM). At timestep , we employ SLAM (Simultaneous Localization and Mapping) module to construct the 2D occupied grid map and estimate the agent’s location and orientation . A local egocentric map (LEM) can be extracted from with the fixed dimension of , where the pixel values of the occupied, unknown and free grids are 255, 128, and 0, respectively. The LEM contains only local information within a limited field of view centered on the agent.
Global Exploration Map (GEM). We extract the maximum rectangular boundary of the explored region from , where all occupied and free grids are termed as the explored state, represented by 255, and unknown grids are termed as the unexplored state, represented by 0. In our DRL model, the Convolutional Neural Network (CNN) is used as an encoder to extract features from the map representation. Standard CNN can only process image inputs with fixed dimensions, while the size and shape of the environment is priori unknown, and the dimension of may keep changing during the exploration. Therefore, we use the nearest-neighbor interpolation to resize to the same dimension () as LEM, and mark the relative location of the agent on it with the pixel value of 128 and the dimension of . We refer to this scaled map as the global exploration map (GEM), denoted as , which provides global perceptual information and ensures the dimensions of the images fed into the CNN are constant. At last, we stack and in the dimension of the channels, resulting in the final map representation with the dimension of .
Auxiliary Information. In addition to the map representation described above, the state space also includes a vector consisting of the lidar ranging results and the agent’s orientation , where the former is useful to assist the agent’s obstacle avoidance, and the latter is necessary for the agent to perceive its own direction in the environment.
III-A3 Action Space
Due to the end-to-end training paradigm of our DRL model, the action space comprises three discrete control commands: straight forward, turn left, and turn right.
III-A4 Reward Function
The reward function is shown in equation (1):
[TABLE]
which contains the following three components:
Encouraging exploration. Let be the map exploration rate at timestep . If the agent explores the new region, the reward is proportional to , which gives a larger reward in the later stage of exploration. Meanwhile, to prevent the negative impact of excessive reward on the training of DRL models, the reward will be clipped into . Otherwise, the agent will receive a minor penalty.
[TABLE]
Successful exploration. If , we can consider that the exploration task has been accomplished, the agent will receive a bonus of , and the episode will be terminated.
[TABLE]
Obstacle avoidance. If the agent collides with obstacles or walls, it will receive a penalty of , and the episode will be terminated.
[TABLE]
III-B Grid-based Autonomous Exploration Simulator
Popular robot simulation platforms, such as Gazebo, are capable of simulating realistic physical properties, but the extremely slow simulation speed prohibits the fast training and evaluation of DRL algorithms. A grid-based autonomous exploration simulator is proposed in [17], but it can only be used to train 2-stage DRL-based exploration models[8, 9, 10, 11]. Inspired by this, we design a more concise and lightweight grid-based autonomous exploration simulator customized for end-to-end training, where the robot is abstracted as a pixel in the grid world. We use the Ray-tracing algorithm[32] to simulate the scanning and mapping process of 2D-lidar, and provide the ground-truth location and orientation of the agent directly, simplifying the slow and computationally complex SLAM process in Gazebo. Besides, the agent’s movements in the grid world (straight forward for one grid, turn left , turn right ) can be considered to be completed instantaneously, replacing the time-consuming moving process in Gazebo. By testing, the single-step simulation time in our proposed grid-based autonomous exploration simulator is about 0.0025s, which is much faster than Gazebo.
Furthermore, our grid-based simulator supports diverse customized maps. In this paper, we design a set of progressively more difficult training maps, as shown in Fig. 2(a), based on four typical map features adapted from [17], which establish the foundation for the following curriculum learning. In addition, we also build a set of test maps with different sizes and layouts to evaluate the generalization of the DRL models, as shown in Fig. 2(b).
III-C Cumulative Curriculum Reinforcement Learning
To speed up convergence and improve sample efficiency, we apply curriculum learning to the training of DRL models based on the simulation environments in Fig. 2. A prevalent problem when combining CL with DRL is catastrophic forgetting[15]. The reason is that the weights of the neural network optimized for the previous tasks have to be partially modified so as to meet the optimization objectives of the new tasks, which usually results in a deteriorated performance on the original tasks[12]. In order to mitigate this issue, we first propose a Cumulative Curriculum Reinforcement Learning (CCRL) training framework, as shown in Algorithm 1.
The main difference between our CCRL and general CL is the concept of “cumulative”: When the training process switches from the former stage to the next, instead of being directly transferred to the following more complex environment, the agent will interact with vectorized environments composed of both historical and new tasks with the aid of AsyncVectorEnv in OpenAI Gym[16]. Vectorized environments run multiple independent copies of the same environment in parallel, take a batch of actions as input, and return a batch of observations and rewards, which is particularly efficient when using neural networks to process batch data. The DRL model will be optimized based on the transitions collected from past and new environments, enabling the agent to learn additional skills on the new task without forgetting the knowledge acquired from the past.
An essential advantage of our proposed CCRL framework is that it can be easily integrated with mainstream DRL algorithms. In this paper, we combine the PPO algorithm with the CCRL framework, named CCPPO (Cumulative Curriculum PPO), as shown in Fig. 3(a). As a comparison, we also train PPO with general CL (Fig. 3(b)) and without a curriculum (Fig. 3(c)), named CPPO (Curriculum PPO) and PPO, respectively. It is worth noting that the number of vectorized environments per stage in CPPO is the same as in CCPPO for a fair comparison.
IV Experiment
In this section, we train CCPPO, CPPO and PPO algorithms in our grid-based simulator, taking full advantage of its rapidity for fast implementation, evaluation and comparison of DRL algorithms.
IV-A Basic Experimental Settings
To be fair, all DRL algorithms in the training process use the same hyperparameters, which can be found in our open-source code. Moreover, the following experimental settings are common to all algorithms and environments.
IV-A1 Improving generalization
To improve the generalization of DRL algorithms, we adopt the following three tricks:
- •
Before the beginning of each episode, the location and orientation of the agent will be randomly initialized.
- •
Before the beginning of each episode, four obstacles will be randomly placed in the environment.
- •
The data augmentation is implemented by rotating the map representation in the state space, as in [30].
IV-A2 State space settings
The dimensions of the map representation and agent’s location are set to and . The scanning angle range of the lidar is with a resolution of . The map representation and auxiliary information are both normalized into . The number of vectorized environments in each map is set to .
IV-A3 Criterion for switching to the next training stage
In the curriculum learning, when the average of the map exploration rate of the last 10 evaluations on the current training map (for CCPPO, on the current highest level map) exceeds 0.95, it will switch to the next training stage.
IV-B Training Stage
IV-B1 CCPPO vs CPPO
The training curves for CCPPO and CPPO are shown in Fig. 4(a). It can be found that when CPPO switches to the following more difficult map, the map exploration rates on the previous levels will gradually decrease, especially on more challenging levels (such as level-3 and level-4), so-called catastrophic forgetting. As a contrast, in CCPPO, the map exploration rates on all levels can converge to nearly 1.0 simultaneously, effectively alleviating the problem of catastrophic forgetting.
In addition, in the last three levels, the initial map exploration rates (indicated by the horizontal dashed line in the figure) of CCPPO are all higher than CPPO. This demonstrates that CCRL can “accumulate” knowledge learned from previous tasks and transfer them among different levels, so as to quickly adapt to more complex environments. However, the general curriculum learning may overfit the current training environments and thus forget previous skills.
IV-B2 CCPPO vs PPO
In this experiment, since PPO does not use curriculum learning, the comparison focuses on sample efficiency, which is defined as the number of transitions sampled on each level when the exponential moving average (EMA) of the map exploration rate first reaches 0.95. The update rule of EMA is formulated as
[TABLE]
where , is the map exploration rate at the th evaluation and is the corresponding EMA. We conduct experiments under five different random seeds, the training curves for CCPPO and PPO are shown in Fig. 4(b), and the sample efficiencies are listed in TABLE I.
We can obviously find that the sample efficiencies of CCPPO on the last four levels are all higher than those of PPO. It can be interpreted that CCRL facilitates the accumulation and transfer of knowledge among different tasks to improve sample efficiency, compared with learning from scratch without a curriculum.
IV-C Generalization Experiments
To compare the zero-shot generalization of different algorithms, after training the same number of steps in the training maps, we directly transfer the final models of CCPPO, CPPO and PPO to a set of test maps with different sizes and layouts (as shown in Fig. 2(b)). There are two metrics for evaluating the performance of different algorithms:
- •
Map exploration rate: is defined as the final map exploration rate at the end of an episode, indicating the exploration completeness of the algorithm.
- •
Exploration steps: is defined as the number of steps when the map exploration rate first reaches 0.95, indicating the exploration efficiency of the algorithm.
We run 20 episodes under five random seeds respectively (100 episodes per test map in total), and the statistics of map exploration rates and exploration steps are shown in Fig. 5. It can be found that CCPPO has higher map exploration rates and lower exploration steps on all five test maps compared to CPPO and PPO. The reason is that CCRL focuses more on the accumulation of knowledge during the learning process, making it easier to generalize generic skills to unseen environments, rather than simply memorizing fixed sequences of actions. In addition, CPPO has the worse generalization and exploration efficiency due to the issues of catastrophic forgetting and overfitting in general curriculum learning. The mapping results of CCPPO in the training and test maps are shown in Fig. 6.
V CONCLUSIONS
In this paper, we train an end-to-end autonomous exploration and mapping model based on DRL and curriculum learning. We present an improved state representation that can adapt to different size environments. Besides, we customize a concise grid-based autonomous exploration simulator specifically for end-to-end training and curriculum learning, facilitating fast implementation, verification and comparison of DRL algorithms. In addition, we propose a Cumulative Curriculum Reinforcement Learning (CCRL) training framework to moderate the catastrophic forgetting issue faced by general curriculum learning, which can also improve the sample efficiency and generalization of DRL algorithms, as shown by experimental results. In future research, we will focus on the sim-to-real transfer and conduct experiments in the real world.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] Z. Li, J. Xin, and N. Li, “End-to-end autonomous exploration for mobile robots in unknown environments through deep reinforcement learning,” in 2022 IEEE International Conference on Real-time Computing and Robotics (RCAR) . IEEE, 2022, pp. 475–480.
- 2[2] L. C. Garaffa, M. Basso, A. A. Konzen, and E. P. de Freitas, “Reinforcement learning for mobile robotics exploration: A survey,” IEEE Transactions on Neural Networks and Learning Systems , 2021.
- 3[3] D. I. Koutras, A. C. Kapoutsis, A. A. Amanatiadis, and E. B. Kosmatopoulos, “Marsexplorer: Exploration of unknown terrains via deep reinforcement learning and procedurally generated environments,” Electronics , vol. 10, no. 22, p. 2751, 2021.
- 4[4] B. Yamauchi, “A frontier-based approach for autonomous exploration,” in Proceedings 1997 IEEE International Symposium on Computational Intelligence in Robotics and Automation CIRA’97.’Towards New Computational Principles for Robotics and Automation’ . IEEE, 1997, pp. 146–151.
- 5[5] F. Bourgault, A. A. Makarenko, S. B. Williams, B. Grocholsky, and H. F. Durrant-Whyte, “Information based adaptive robotic exploration,” in IEEE/RSJ international conference on intelligent robots and systems , vol. 1. IEEE, 2002, pp. 540–545.
- 6[6] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al. , “Human-level control through deep reinforcement learning,” nature , vol. 518, no. 7540, pp. 529–533, 2015.
- 7[7] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van Den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, et al. , “Mastering the game of go with deep neural networks and tree search,” nature , vol. 529, no. 7587, pp. 484–489, 2016.
- 8[8] F. Niroui, K. Zhang, Z. Kashino, and G. Nejat, “Deep reinforcement learning robot for search and rescue applications: Exploration in unknown cluttered environments,” IEEE Robotics and Automation Letters , vol. 4, no. 2, pp. 610–617, 2019.
