Sim-and-Real Reinforcement Learning for Manipulation: A Consensus-based Approach
Wenxing Liu, Hanlin Niu, Wei Pan, Guido Herrmann, Joaquin Carrasco

TL;DR
This paper introduces CSAR, a consensus-based deep reinforcement learning algorithm that enhances sim-and-real training for robot manipulation, achieving better efficiency and effectiveness in policy learning.
Contribution
The paper proposes a novel CSAR algorithm that trains agents in both simulation and real-world environments, revealing key phenomena about policy optimality and the benefits of multiple simulation agents.
Findings
Best simulation policy is not optimal for sim-and-real training.
Increasing the number of simulation agents improves training performance.
CSAR achieves comparable performance in sim-and-real manipulation tasks.
Abstract
Sim-and-real training is a promising alternative to sim-to-real training for robot manipulations. However, the current sim-and-real training is neither efficient, i.e., slow convergence to the optimal policy, nor effective, i.e., sizeable real-world robot data. Given limited time and hardware budgets, the performance of sim-and-real training is not satisfactory. In this paper, we propose a Consensus-based Sim-And-Real deep reinforcement learning algorithm (CSAR) for manipulator pick-and-place tasks, which shows comparable performance in both sim-and-real worlds. In this algorithm, we train the agents in simulators and the real world to get the optimal policies for both sim-and-real worlds. We found two interesting phenomenons: (1) Best policy in simulation is not the best for sim-and-real training. (2) The more simulation agents, the better sim-and-real training. The experimental video…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Robot Manipulation and Learning · Adversarial Robustness in Machine Learning
Sim-and-Real Reinforcement Learning for Manipulation:
A Consensus-based Approach
Wenxing Liu1,3
Hanlin Niu1,3
Wei Pan2
Guido Herrmann3 and Joaquin Carrasco3 This work was supported by EPSRC project No. EP/S03286X/1, EPSRC RAIN project No. EP/R026084/1, EPSRC RNE project No. EP/P01366X/1 and UKAEA/EPSRC Fusion Grant 2022/2027 No. EP/W006839/1.1Wenxing Liu and Hanlin Niu are with Remote Applications in Challenging Environments (RACE), United Kingdom Atomic Energy Authority, Culham, UK. (Corresponding author: Hanlin Niu. [email protected]).2Wei Pan is with the Department of Computer Science, The University of Manchester, Manchester, UK.3Wenxing Liu, Hanlin Niu, Guido Herrmann and Joaquin Carrasco are with the Department of Electrical & Electronic Engineering, The University of Manchester, Manchester, UK.
Abstract
Sim-and-real training is a promising alternative to sim-to-real training for robot manipulations. However, the current sim-and-real training is neither efficient, i.e., slow convergence to the optimal policy, nor effective, i.e., sizeable real-world robot data. Given limited time and hardware budgets, the performance of sim-and-real training is not satisfactory. In this paper, we propose a Consensus-based Sim-And-Real deep reinforcement learning algorithm (CSAR) for manipulator pick-and-place tasks, which shows comparable performance in both sim-and-real worlds. In this algorithm, we train the agents in simulators and the real world to get the optimal policies for both sim-and-real worlds. We found two interesting phenomenons: (1) Best policy in simulation is not the best for sim-and-real training. (2) The more simulation agents, the better sim-and-real training. The experimental video is available at: https://youtu.be/mcHJtNIsTEQ.
I INTRODUCTION
As an essential component in robotic control, deep reinforcement learning (DRL) has been widely used in various applications [1, 2, 3]. The training process of DRL [4] builds the bridge between the environment state and the action, thereby maximizing the cumulative reward. Learning from the simulation is safer, cheaper and faster while learning from the real world is more dangerous, expensive and slower. If the simulation shows high fidelity, the training model in the simulation can be transferred directly to the real world. However, in many circumstances, the simulation cannot mimic the real world very well, which limits robot performance in the real world. To overcome this difficulty, we develop a sim-and-real training method to balance the relationship between the simulation and the real world. We use concepts from control engineering, i.e. consensus [5, 6], to accomplish sim-and-real training.
In this work, we propose a CSAR algorithm that combines consensus-based training with DRL in a sim-and-real environment, as shown in Fig. 1. We apply CSAR to a group of simulated agents together with a real agent each learning to carry out a pick-and-place task with a suction robot device. Compared to conventional sim-to-real training method, the challenges of CSAR DRL are 1) information exchange between simulated and real robots, for instance, generating communication in a mixed environment, 2) data-efficient collection for training in a sim-and-real environment such as handling data from multiple robots simultaneously, 3) data pre-labelling for suctioning in a sim-and-real environment, for example, using aruco makers to locate suctioned objects. The main contributions of this paper can be concluded as follows:
A complete CSAR method is proposed for manipulators to learn pick-and-place tasks. By applying consensus-based training, the proposed method saves training time and reduces the number of required real robot training steps while maintaining a comparable suction success rate, which is cost-effective. 2. 2.
An end-to-end and lightweight neural network is proposed to train the suction policy, which uses raw 3D visual data directly without pre-labelling. The effectiveness and feasibility of the CSAR method are validated through simulation and real-world experiments. 3. 3.
We extend the consensus approach [7] from theory and simulations to a real-world pick-and-place problem and show the effectiveness of the proposed approach.
The structure of the paper proceeds as follows. Section II elucidates the related work. Section III details the CSAR algorithm. Experimental validation is given in Section IV to highlight the feasibility of our proposed algorithm. Section V summarizes this paper.
II RELATED WORK
Sim-to-real: When a DRL model is transferred from simulation to the real world, the adoption problem becomes challenging as real-world environments contain unpredictable disturbances [8]. Fine-tuning has been widely used to bridge the gap between simulated and real environments [9, 10, 11]. However, fine-tuning usually takes a long time to perform parameter adaptation, which increases the experimental cost. Some recent works use only simulation but work well in the real world. For instance, with only simulation, a distance function was trained in [12] between the current pose and the nearest optimal pose. In [13], a grasp quality network was proposed to evaluate robust grasp configuration based on the antipodal grasping sampling method. The key idea of these two papers is to use depth data rather than RGB images since depth images contain less information. Nevertheless, it is challenging for a depth camera to measure thin, dark colour objects because of their physical properties in the real world. Under this condition, performance cannot be guaranteed. Our approach focuses on reducing the gap between simulation and the real world, which is more general and flexible.
Sim-and-real: Sim-and-real training is a recent topic which focuses on adjusting the simulation according to real samples [14, 15]. A novel domain adaptation approach for robot perception was developed in [16] to close the sim-and-real gap by finding common features of real and synthetic data. In [17], the agent’s parameters in the simulation were updated to match the behaviour in the real world. Compared with [17], our approach is capable of dealing with situations that are hard to simulate precisely. In [18], one agent was used to select a simulated or real environment with a given probability and to interact. Transitions from all environments were stored in a common replay buffer to update training parameters. In contrast to [18], our method creates a sim-and-real environment directly with consensus, which avoids any form or transition but runs simulated and real agents in parallel. This is unique and has not been done before.
Reinforcement learning for manipulation: Reinforcement learning has been exploited to deal with robotic tasks [7, 19, 20]. In our work, we focus on improving training efficiency and saving real-world training costs of sim-and-real DRL for robotic manipulators. An efficient real-time hybrid path planning scheme was proposed in [21] to handle the uncertain dynamics of a robot manipulator by combining the probabilistic roadmap method with DRL. How to modulate the elementary movement of a robot arm through meta-parameters using reinforcement learning was proposed in [22]. In [23], a robotic manipulator was trained using DRL to solve the task of grasping an initially invisible object via a sequence of grasping and pushing actions. A high-precision peg-in-hole target task was selected in [24] for force-controlled robotic assembly with DRL. Specifically, the force and moment of the robotic manipulator end effector were chosen as the state. Nevertheless, the authors target solving specific low-level tasks such as motion planning in the work mentioned above. Our method pays attention to high-level tasks by treating each robot manipulator as an agent, which is more general and has a wider range of applications.
III METHODOLOGY
We extend the consensus-based approach in [7], which only focuses on simulations, to sim-and-real scenarios. The proposed effective and efficient CSAR method can increase sim-and-real training speed as well as save real-world training costs with consensus-based training.
III-A System Overview
Fig. 2 describes the overview of our proposed framework. The predefined workspace in the simulation is captured by a fixed simulated camera, which provides an ideal RGB-D image each time. Then the ideal RGB-D image is orthographically projected in the direction of gravity to construct the colour heightmap and the depth heightmap , which are the inputs of our framework. Both heightmaps are fed into the Q-function neural network to anticipate pixel-wise best suction position . Given the specific use of these neural networks modelling the Q-function for pick and place success through suction gripping, we may call these “suction networks”. The suction height can be found from .
When it comes to the real world, the predefined workspace is captured by a fixed azure kinect camera. Compared with the ideal RGB-D image which is obtained from the simulated camera, the real-world RGB-D image contains more camera distortion [25]. Similarly, the real-world RGB-D image is orthographically projected in the direction of gravity to construct the colour heightmap and the depth heightmap which are also fed into the suction network to predict real-world pixel-wise best suction position . The suction height can be also acquired from .
After performing predictions in both environments, consensus-based training is applied to the training parameters of each simulated or real agent. The suction process of each agent is carried out in parallel, which saves training time.
III-B DRL Setup
III-B1 Action Space
As stated in Section A, the action space is a Cartesian motion command that consists of pixel-wise best suction position. In the simulated environment, . Correspondingly, in the real world. The suction height and can be acquired from and .
III-B2 State Space
As shown in Fig. 2, the state space denotes the colour heightmap and depth heightmap of the captured RGB-D image. In the simulated environment, and are acquired by the fixed simulated camera. In the real environment, and can be obtained from the fixed azure kinect camera.
III-B3 Reward Space
The distance in the simulated environment can be computed by
[TABLE]
where and denote positions of the centre of the expected suctioned object of the agent, respectively.
We assign suction reward if the target is successfully suctioned, otherwise . Thus, the DRL reward for each agent in the simulation can be defined as
**
[TABLE]
where stands for the reward of the agent in the simulated environment, represents distance threshold of the agent, and are the positive reward when is within the corresponding range.
The DRL reward for each agent in the real environment is given by
[TABLE]
III-B4 Neural Network Structure
As stated in Fig. 2, the input of the suction net passes data through ResNet-18 [26] to extract concatenated features from the colour heightmap and the depth heightmap. The aforementioned features are fed into a Batch Normalization layer [27] with 1024 input features, a ReLu layer [27], a Convolution layer [27] with 1024 input channels, and 1 output channel, then are processed by a bilinear upsample layer [27] with a scale factor of 16. The output of the suction net has the same image size as the heightmap input, which is a dense pixel-wise map of different Q values. The pixel which has the maximum Q value represents the best suction position.
Remark 1
It should be noted that the suction net can be substituted by any state-of-the-art neural network. Since we use a standard laptop for training, we purposely design a lightweight version of the suction net inspired by [28].
During each training iteration , the training objective is to minimize the temporal difference error [29]:
[TABLE]
where and represents all available actions, stands for the discount factor, represents the action-value function, is the reward, stands for the training parameters of the suction network at time , denotes the target training parameters.
III-B5 Loss function
Inspired by [28], we use the Huber loss function [30] to train our proposed suction network in both simulated and real environments. The loss function at the iteration can be computed as follows: **
[TABLE]
Gradients are only passed through the single pixel on which the action is executed during each iteration . All other pixels propagate with 0 loss [28].
III-C Consensus-based Training
The Q-function of each simulated or real agent is trained through a consensus based algorithm. Hence, we wish to introduce at first the consensus network structure which facilitates that training process. The interaction topology of agents can be depicted by an undirected graph , where represents a vertex set and stands for an edge set . The edge if the and agents are connected with one another [31]. The adjacency matrix of can be described as , where if , otherwise . Hence, the Laplacian matrix of is defined as , where and [32]. For an undirected topology, is positive semi-definite. , where . If the graph has a spanning tree, the rank of should be [32].
For an undirected graph , if represents the updated training parameter of after a single consensus step and stands for the row vector of the training parameter for agent in the graph, the consensus training step of each agent can be described as
[TABLE]
where , the element of the graph adjacency matrix, is engendered by the undirected graph and stands for the input of the agent .
By integrating (7) and (6), the consensus algorithm can be used to update an agent in the following scheme:
[TABLE]
where is the element of the Laplacian matrix and represents the consensus protocol of the agent. The training parameter update of all the agents with a single consensus step can be summarised as
[TABLE]
where stands for the consensus protocol for all agents, and denote the and identity matrix, represents the Laplacian matrix. By repetitively computing (9), this consensus algorithm makes all agents converge to their weighted average [33].
III-D Consensus-based Training with DRL
Given the consensus network structure in the previous sub-section, the training algorithm for the training parameters in (4) in the DRL is now introduced. As stated in [34], the process of updating for the agent is given as:
[TABLE]
where represents the learning rate.
By applying (8), the training process of the CSAR algorithm can be summarised as:
[TABLE]
where .
Substituting (11) into (12), we can get
[TABLE]
Let , for agents, the update of the training parameters in our suction network in the iteration can be illustrated as
[TABLE]
Algorithm 1 summarizes our CSAR algorithm.
IV EXPERIMENTS AND RESULTS
The feasibility of the CSAR algorithm is validated in this section. The system is implemented on a standard laptop with Nvidia GTX 2070 super and Intel Core i7 CPU (2.6 GHz) with 16 GB RAM. The experimental video is available at: https://youtu.be/mcHJtNIsTEQ.
IV-A Experiment Setup
IV-A1 Simulation
Our system in the simulated environment is trained in Coppeliasim [35] with Bullet Physics 2.78 for dynamics, as demonstrated in Fig. 1. The simulation setup for each agent consists of a UR5 robot arm with a suction gripper [36]. The suctioned objects in the simulated environment are cubes with a side length of cm. The motion planning task for each UR5 robot arm is accomplished by Coppeliasim [35] internal inverse kinematics. Simulated cameras are used to capture RGB-D images of each agent in a m2 workspace. The resolution of the simulated RGB-D images is .
IV-A2 Real World
The setup for each agent in the real environment is composed of a UR5 robot arm with a Robotiq EPick vacuum gripper. The suctioned objects are cubes with a side length of cm. To pick and place objects successfully with the suction gripper in the sim-and-real environment, the objects should have a flat surface and no overlap between objects placed in the workspace. We use a fixed Azure Kinect camera to acquire real-world RGB-D images with a resolution of . The location of the Azure Kinect camera is shown in Fig. 1, which can generate a top-down view in a m2 workspace.
IV-A3 Reward
Depending on the intrinsic and distortion of the Azure Kinect camera and the size of our suction gripper, we assign , , , and m in (2). These values can also be reconfigured for other robotic platforms.
IV-A4 Neural Network
The proposed framework is fully trained under self-supervision through the interactions between the UR5 robot arms and the sim-and-real environment. The learning rate in (10) has a fixed value of 0.0001. The discounted factor listed in (4) is set to 0.5. The future reward discount is fixed at 0.5. The total training steps parameter is initialized at 270. Algorithm 1 satisfies -greedy exploration strategy with initialized at 0.5 and annealed to 0.1 over training. The simulated camera and the Azure Kinect camera capture RGB-D images to generate colour and depth heightmaps, which are fed into the suction nets to predict pixel-wise best suction positions.
IV-A5 Evaluation Metric
The suction performance of the agent can be evaluated using the suction success rate , which is defined as follows:
[TABLE]
where represents the number of successful target suctions of the agent, represents the number of iterations of the agent.
We explore various training strategies to discover the most suitable training conditions for robots:
Sim-and-Real: Only simulation samples are used to train and optimise the model initially. When the suction success rate in the simulation reaches 0.5, we switch to the CSAR method with 3 simulated robots and 1 real robot.
Sim-to-Real: Only simulation samples are used to train and optimise the model at the beginning. When the suction success rate in the simulation reaches 0.5, we switch to real-world training with 1 real robot.
IV-B Sim-and-Real is Better Than Sim-to-Real
Fig. 3 demonstrates the suction success rate of the real robot using two different training strategies. The interaction topology of Sim-and-Real is shown in Fig. 4 (c). When applying the Sim-and-Real strategy, the suction success rate of the real robot reaches 80% at around 140 training steps, which outperforms the Sim-to-Real strategy. Since our policy for each robot is greedy deterministic, a robot may execute the same action repetitively if there is no environment change when using the Sim-to-Real training strategy. However, by applying consensus-based training, the simulated agent can be used to introduce noise indirectly into the sim-and-real environment, which prevents robots from getting stuck in the same action. In summary, applying the Sim-and-Real strategy leads to a faster training speed, which saves real-world training costs.
IV-C Best Policy in Simulation is Not the Best for Sim-and-Real Training
A striking observation from our experiment is that the best-obtained policy trained in simulation is not the best pre-trained model to start the co-training between simulated and real robots, as shown Fig. 5. When the suction success rate of the pre-trained simulation model is 0.5, the Sim-and-Real strategy achieves the best performance. When the suction success rate drops to 0.3, it takes longer for the real robot to solve the task. Surprisingly, when the suction success rate of the pre-trained simulation model is too high (0.7, 0.9), the performance deteriorates.
This is counterintuitive, as shown in the Sim-to-Real experiment, that the best policy obtained in the simulation is typically the one to be deployed. This observation suggests that the “mediocre” policy is the best for co-training. When the success rate of the pre-trained simulation model is too high, the sim-and-real framework will be initialised at a value that is close to the optimal simulation value. This will take longer to converge to the mixed optimality in a sim-and-real environment. As a result, applying the “mediocre” policy can reduce real robot training costs and save the pre-training time in the simulation.
IV-D The More Agents in Simulation, the Better for Sim-and-Real Training
Readers may wonder why we use 3 simulated robots and 1 real robot during training. Therefore, we vary the number of simulated robots when using the Sim-and-Real strategy. Fig. 6 describes the suction success rate when using the Sim-and-Real strategy with different numbers of simulated robots. The interaction topology used in Fig. 6 is shown in Fig. 4. It takes around 260 steps to make the real robot arrive at 80% suction success rate when using 1 simulated robot and 1 real robot strategy. In the case of 2 simulated robots 1 real robot, the required training steps descend to around 240. Only around 140 steps are required to maintain the same suction success rate when using the 3 simulated robots 1 real robot strategy. More simulated robots participating in the proposed framework can accelerate the training speed and exhibit good robustness in the sim-and-real environment, thus decreasing the number of required real robot training steps while maintaining a comparable suction success rate.
IV-E Generalisation of Real-world Unseen Objects
The Sim-and-Real strategy is capable of generalising to novel objects (Fig. 7) with a suction success rate of . After training on cubes in both simulated and real environments, the CSAR training model can also be applied to pick and place novel objects such as cylinders and irregularly shaped objects with different heights, as shown in Fig. 8.
V CONCLUSION
In this work, we propose a CSAR approach which is able to improve sim-and-real training speed and reduce real-world training costs. By implementing the Sim-and-Real strategy, the suction success rate of the real robot attains 80% at around 140 training steps, which outperforms the Sim-to-Real strategy. Applying the “mediocre” policy can not only reduce the number of required real robot training steps but also save the pre-training time in the simulation. More simulated robots participating in the CSAR method increase the training speed, thereby reducing real-world training expenses. The Sim-and-Real strategy is also capable of generalising to novel objects. The CSAR method is a straightforward generalization and practical verification of the team’s recently developed theory of a consensus-based RL approach [7]. In the future, an optimisation of the CSAR approach will be exploited to tackle more complicated scenarios.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] B. Beyret, A. Shafti, and A. A. Faisal, “Dot-to-dot: Explainable hierarchical reinforcement learning for robotic manipulation,” in 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) , 2019, pp. 5014–5019.
- 2[2] Y. Han, I. H. Zhan, W. Zhao, J. Pan, Z. Zhang, Y. Wang, and Y.-J. Liu, “Deep reinforcement learning for robot collision avoidance with self-state-attention and sensor fusion,” IEEE Robotics and Automation Letters , vol. 7, no. 3, pp. 6886–6893, 2022.
- 3[3] R. Han, S. Chen, S. Wang, Z. Zhang, R. Gao, Q. Hao, and J. Pan, “Reinforcement learned distributed multi-robot navigation with reciprocal velocity obstacle shaped rewards,” IEEE Robotics and Automation Letters , vol. 7, no. 3, pp. 5896–5903, 2022.
- 4[4] R. S. Sutton, “Introduction: The challenge of reinforcement learning,” in Reinforcement Learning . Springer, 1992, pp. 1–3.
- 5[5] K. Wu, J. Hu, B. Lennox, and F. Arvin, “Sdp-based robust formation-containment coordination of swarm robotic systems with input saturation,” Journal of Intelligent & Robotic Systems , vol. 102, no. 1, pp. 1–16, 2021.
- 6[6] K. Wu, J. Hu, Z. Ding, and F. Arvin, “Finite-time fault-tolerant formation control for distributed multi-vehicle networks with bearing measurements,” IEEE Transactions on Automation Science and Engineering , 2023.
- 7[7] W. Liu, H. Niu, I. Jang, G. Herrmann, and J. Carrasco, “Distributed neural networks training for robotic manipulation with consensus algorithm,” IEEE Transactions on Neural Networks and Learning Systems , pp. 1–15, 2022.
- 8[8] X. B. Peng, M. Andrychowicz, W. Zaremba, and P. Abbeel, “Sim-to-real transfer of robotic control with dynamics randomization,” in 2018 IEEE international conference on robotics and automation (ICRA) . IEEE, 2018, pp. 3803–3810.
