On Training Flexible Robots using Deep Reinforcement Learning
Zach Dwiel, Madhavun Candadai, Mariano Phielipp

TL;DR
This paper investigates the effectiveness of deep reinforcement learning in training flexible robots, demonstrating its ability to learn robust policies for complex tasks despite sensor sensitivities.
Contribution
It systematically evaluates DRL methods for flexible robot control, highlighting their potential and limitations in real-world adaptable robotics.
Findings
DRL can learn efficient policies for flexible robots.
Deep Deterministic Policy Gradients are sensitive to sensor choices.
Adding more sensors does not always simplify learning.
Abstract
The use of robotics in controlled environments has flourished over the last several decades and training robots to perform tasks using control strategies developed from dynamical models of their hardware have proven very effective. However, in many real-world settings, the uncertainties of the environment, the safety requirements and generalized capabilities that are expected of robots make rigid industrial robots unsuitable. This created great research interest into developing control strategies for flexible robot hardware for which building dynamical models are challenging. In this paper, inspired by the success of deep reinforcement learning (DRL) in other areas, we systematically study the efficacy of policy search methods using DRL in training flexible robots. Our results indicate that DRL is successfully able to learn efficient and robust policies for complex tasks at various…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
On Training Flexible Robots using Deep Reinforcement Learning
Zach Dwiel1∗ and Madhavun Candadai2∗ and Mariano Phielipp3 ∗with equal contributions1Zach Dwiel is with Intel AI Labs, Bloomington, IN, U.S.A. [email protected]2Madhavun Candadai is with the Program in Cognitive Science, and School of Informatics, Computing and Engineering, at Indiana University, Bloomington, IN, U.S.A. [email protected]3Mariano Phielipp is with Intel AI Labs, Phoenix, AZ, U.S.A. [email protected]
Abstract
The use of robotics in controlled environments has flourished over the last several decades and training robots to perform tasks using control strategies developed from dynamical models of their hardware have proven very effective. However, in many real-world settings, the uncertainties of the environment, the safety requirements and generalized capabilities that are expected of robots make rigid industrial robots unsuitable. This created great research interest into developing control strategies for flexible robot hardware for which building dynamical models are challenging. In this paper, inspired by the success of deep reinforcement learning (DRL) in other areas, we systematically study the efficacy of policy search methods using DRL in training flexible robots. Our results indicate that DRL is successfully able to learn efficient and robust policies for complex tasks at various degrees of flexibility. We also note that DRL using Deep Deterministic Policy Gradients can be sensitive to the choice of sensors and adding more informative sensors does not necessarily make the task easier to learn.
I INTRODUCTION
Inspired by the robustness and adaptability of natural systems, roboticists are becoming increasingly aware of the benefits of building soft and flexible robots [1, 2]. While research in soft robots involving stretchable electronics is on one end of the spectrum, the other, perhaps more populated end, is dominated by research involving strict rigidity constraints on robot design. Fully soft robots could potentially mimic living systems and hence significantly outperform rigid robots, but pose many challenges in their fabrication and control [2]. Rigid robots, on the other hand, are easy to control because simple models of their dynamics can be built, but are heavy and expensive due their strict compliance requirements and are constrained in the robot’s adaptability [3]. Somewhere along this spectrum lies a class of robots that could be built from rigid hardware but can nevertheless allow for flexibility in their joints and material. Research in this area involving the use of flexible actuators, has been of interest for several decades now and has shown great promise [4, 5]. Reliable training methods for this class of robots would enable higher payload-to-weight ratios, higher speeds, lower cost, and safer operation due to less inertia [6].
With the aim of furthering this line of research, the primary questions that this paper addresses are the following:
Can policy search serve as a reliable methodology for end-to-end training of flexible robots? 2. 2.
Does learning with policy search enable generalization to levels of flexibility not seen during training? 3. 3.
How sensitive are policy search methods to choice of sensors?
Deep artificial neural networks, in conjunction with reinforcement learning, namely deep reinforcement learning (DRL) has shown to be successful in a wide variety of problems [7]. As a learning methodology, a major advantage of the ”deep” feature in DRL is that the task relevant features are also learnt and do not have to be hand-designed. Furthermore, DRL can learn complex tasks merely with reward signals as opposed to supervised learning where true labels are required. The presence of a reward function enables generation of easy training feedback signals removing the need for huge training datasets. In this paper we demonstrate that policy search using deep artificial neural networks on flexible hardware results in performance that is at least as good as (and sometimes better than) performance on rigid hardware.
While learning and performance under flexibility is a desirable property in training algorithms, another important property is its ability to be robust to changes in flexibility. This is particularly the case when policies learned under specific levels of flexibility are transferred over to other levels of flexibility. An algorithm that is robust to different levels of flexibility places lower demands on the simulator to match the physical robot precisely or if training on real robots, lower demands are placed on the precision and consistency between robots. It also allows for sim2real transfer from one simulation to several robots that could vary in their flexibility, thus reducing constraints on robot hardware design to meet strict rigidity requirements. Furthermore, being robust to different levels of flexibility allows a policy to adapt to changes in robot dynamics due to age and environmental factors. In this paper we study the robustness of policies learned using policy search and demonstrate that they are in fact robust across a wide-range of flexibility levels.
Another aspect of designing flexible robots we study in this paper is the choice of sensors. Building models of robot dynamics often involve the assumption of a particular set of sensors, which then become part of the system of equations that define the model. With end-to-end training and policy search, there is no constraint on what sensors can be included and it adds minimal overhead to the robot design to add or remove sensors, in comparison to model based approaches. This can be an advantage because it is easy to try different sensor combinations, but it can also be a disadvantage because now sensor-choice is a hyper-parameter and the sensitivity of training to sensor-choice needs to be understood. In this paper, we demonstrate that policy search is sensitive to sensor choice and that adding sensors to provide more information to the learning algorithm is not necessarily helpful.
Being able to train flexible hardware using policy search provides several advantages to robot designers - it lowers the cost of robot hardware by dropping the rigidity constraint; complex hand-designed models of robot dynamics no longer need to be built, as policy search with DRL is a general purpose learning approach that enables end-to-end training; and it allows driving the same hardware at higher speeds and higher payloads. Our demonstration of robustness of learned policy to flexibility implies resiliency across robots of different levels of flexibility as well as resiliency across physical changes over the course of the robot’s lifetime. This removes the need for periodically re-calibrating the model dynamics, with the model potentially getting increasingly complex with wear and tear of parts. Ultimately, this approach takes us one step closer towards robots that are adaptive and robust like natural systems.
The rest of the paper is organized as follows: the next section talks about related work in this area, which is following by a section outlining the tasks and the specific policy search method used, which is then followed by sections detailing the experiments that were run, the results of those experiments, a discussion of the results and finally a conclusion.
II RELATED WORK
Robot controller design is dominated by building precise mathematical models of its dynamics. It is not always practical to build a general model of a robot’s dynamics that is invariant to the various real-world factors ranging from noise to changes in the environment, motor backlash, motor torque output, or the focus of this paper, link flexibility. In such cases, reinforcement learning and policy search algorithms that can learn from a robot’s experience have been shown to be successful [8, 9] for tasks such as object manipulation [10, 11, 12], locomotion [13, 14, 15, 16] and flight [17]. However, most of this work involves using a model-free component to approximate features of the robot or the world that cannot be modeled while still using model-based controllers for other parts of the system [12, 18]
In work where flexibility is taken into consideration, learning is still based either on building a more complex model [6, 19, 20], an approximate model [21] or plugging in a learned-model component into a model-based controller. Recently, work involving end-to-end model-free methods using deep reinforcement learning have been demonstrated successfully in rigid real robots [22, 23, 16]. [16] have shown that learning directly in hardware is possible with policy search, they rightly point out that even in simple tasks, factors such as joint slackness etc. make it very difficult to train. While they and [24] have shown that this is possible using policy search, this paper systematically studies how policy search methods perform with flexible hardware.
III METHODS
III-A Robot design
In order to systematically study the effect of flexibility on learning, we set up a single link robot arm on MuJoCo [25] in OpenAI gym like environments, with a “pseudo-joint” in the middle of the arm that the learner could not sense directly. The pseudo-joint was modeled with a spring joint so as to mimic a material bending from stress less than its proportionality limit, which is the stress limit under which the material stress strain curve follows Hooke’s Law. We were able to scan through different levels of flexibility by adjusting the time-constant on the middle joint. The learning algorithm could only directly control the base joint and not the spring joint. The choice of sensors plays a crucial role in the amount of information that is available to the learner about the hardware flexibility. Intuitively, first order measures such as angle sensors would provide less information when compared to second or third order sensors such as accelerometers. Also, sensors placed on the tip of the arm and feet of the robots will be more sensitive to the flexing of the link than sensors on the joints. In order to study the ability of the policy search method to learn and potentially take advantage of flexibility we tried three different sensor configurations - observations that included inertial measurement units (IMU) along with angle sensors, observations that only had angle sensors and observations that only had the IMUs. Besides these sensors, task-specific inputs were provided depending on the nature of the task.
III-B Task design
This section outlines the different tasks that the flexible robot arm was optimized to perform. We designed four tasks with increasing levels of difficulty based on the three canonical abilities that are typically expected of robot arms - navigating to a specific location, acting on objects in the environment and manipulating objects in the environment based on a goal. The ability of the policy search method to learn these canonical tasks under different levels of flexibility (including no flexibility) was tested for each task with different sensor configurations.
III-B1 Reacher
The first task we tested on involved moving the tip of a flexible arm to a specific goal position by controlling the base joint torque (Fig. 1A). Besides the sensors specified above, the vector between the end of the robot arm and the goal position was added to the observations as in the Gym environment this task was derived from. Note that, due to its flexibility, the robot arm could be in many different configurations while observing the same goal vector. The robot arm was started at random positions within a range of radians from a reference position. The task was setup to incentivize reaching the goal as soon as possible, by providing a reward of -1 for every time step the arm was not at the goal and a reward of 0 when the it was within a euclidean threshold distance of of the goal. For reference, the length of the arm is . Besides the distance based reward, there was an energy penalty in each time step that was estimated based on the motor action as . A training episode lasted until the goal was reached or for 200 time steps, whichever was earlier.
III-B2 Reacher Stay
This task was identical to ‘Reacher‘ except that in this case, the robot arm was expected to stay at the goal position upon reaching it (Fig. 1B). This was achieved by not finishing a training episode upon reaching the goal and instead providing a reward of 0 for staying at the goal position, and again a reward of -1 for each step it is not at the goal position. An episode in this case lasted for a fixed duration of 200 time steps during which reward can be maximized by reaching the goal position as quickly as possible and staying there. Note that in flexible robots, efficiently performing this task requires that vibrations are damped so as to arrive at a stop in the goal position and stay there.
III-B3 Thrower
Flexibility in a robot arm could either deter or aid its ability to interact with an object in the environment. In this task, the goal for the robot arm was to “throw” a puck as far as possible (Fig. 1C). The puck is placed at a fixed position in the robot’s environment and the arm starts at random positions similar to the previous tasks. Also similar to previous tasks, the observation includes the vector between the fingertip of the robot and the object position. Based on its flexibility, the robot should learn to push the object appropriately so as to generate maximum acceleration of the puck. In this task, the only reward the robot receives during an episode is the energy penalty, but at the end of 200 time steps, a reward proportional to the distance between the initial and final position of the puck is provided.
III-B4 Ant
While the tasks discussed so far involved a single link robot, with a single flexible component, in order to test more complex tasks we modified the Ant environment in Open AI gym. This task involved training a four-legged ant to walk with rewards being provided at each time step proportional to the distance walked, the energy spent, contact with the floor and for maintaining stable dynamics (Fig. 1D). The Ant was made flexible by the addition of two flexible ball joints to each leg - one between the hip and the knee, and other between the knee and the ankle. Also unlike previous tasks, these pseudo-joints, being ball joints, have higher degree of freedom to be flexible, and since all four legs have these joints, learning to walk involves the coordination of multiple flexible parts, thus making this task significantly more difficult to learn.
III-C Policy Search Method: DDPG
Deep Deterministic Policy Gradients (DDPG) is a stochastic reward based policy search method that has been shown to stably learn policies for continuous observation and action spaces [24]. The architecture involves an actor-critic neural network pair, where the actor takes in observations to provide actions that can be performed, and the critic provides an estimate of the expected discounted long-term reward given the policy. The critic is trained to better estimate the expected long-term reward based on the actual rewards received from the environment. The actor is trained using the training signal provided by the critic to adjust its parameters in the direction that maximizes expected long-term reward. The training dynamics between these two networks is stabilized by having copies of these networks that are updated at a slower time-scale, and whose targets essentially convert the training of the main actor and critic to that of a supervised learning problem. In [24] it has been demonstrated that DDPG can efficiently learn competitive policies across a wide range of tasks. Hyper-parameters such as network architecture (2 hidden layers with 400 and 300 neurons respectively, identical actor and critic networks), with minibatch sizes of 64, replay buffer size of , learning rate ( and for actor and critic respectively), were same as the original DDPG paper and did not require any fine tuning to for each task. Intel’s RL-Coach [26] implementation of DDPG was employed in conjunction with customized MuJoCo environments to perform the experiments outlined in the next section.
III-D Information theoretic analysis
In the analysis of the impact of different sensor choices on learning performance, we measure the mutual information between the observations and the end-effector position while running a random policy and similarly between the observations and rewards while running a trained policy. This was carried out using the python infotheory package [27], where data distributions were estimated using equal interval binning followed by average shifted-histograms [28] in order to smoothen the boundary effects of arbitrary binning. Two different binning resolutions, 25 and 50, were tried giving similar results. In both cases, shifted histograms with 3 shifts was used to estimate the final data distribution. Results reported in the paper are from binning with 50 bins along each dimension.
IV EXPERIMENTS
The methods section outlined the robot and task design, the policy search method and the analysis techniques. This section outlines the specific experiments that were conducted to systematically study how flexibility affects learning performance of policy search.
IV-A Can policy search serve as a methodology for end-to-end training of flexible robots?
The first experiment involved studying if policy search methods, as an end-to-end learning methodology, can learn to perform tasks when hardware components were flexible. In order to study this, we optimized deep neural network policies using DPPG for the tasks mentioned in the previous section while systematically increasing the level of flexibility. For each task, for each level of flexibility, 20 independent runs with different random seeds were executed to study the variance in performance.
IV-B Are policy search methods robust to other levels of flexibility?
The second experiment involved studying policies trained on one level of flexibility, and tested in all other levels of flexibility. The focus of this experiment is to study the relative change in performance when tested on levels flexibility other than level it was trained on. An algorithm can be considered robust if it performs comparably well with perturbations during testing in relation to its performance during training. To this end, of the 9 different levels of flexibility that was available, we evaluated policies optimized for one level of flexibility to its performance on the remaining 8 levels. This was repeated for each of the 20 runs and for each level of flexibility and task.
IV-C Are policy search methods sensitive to choice of sensors?
Next, we studied the effect choice of sensors had in training flexible robots using policy search by including IMU sensors in the robot in addition to joint angle sensors. The intuition being, in flexible robots, the higher order IMU sensor provides more information about the flexible dynamics than first-order joint angle sensors. Training was carried out on the same tasks with IMUs added at each end-effector, in addition to the joint angle sensors. Similar to experiment 1, the maximum mean evaluation reward from 100 evaluation trials across training time was considered the representative solution, and this was repeated for 20 runs for each sensor choice and for each level of flexibility. Performance-flexibility curves were compared across the different sensor choices.
V RESULTS
V-A Policy search is resilient to flexibility of robot hardware
As is common with DDPG, the policy that performed the best during the 100 evaluation episodes was selected as the representative solution for a particular run. Comparing the mean and standard deviations of the best policies over 20 independent runs, showed that DDPG was able to reliably learn efficient policies across all levels of flexibility tested using joint angle sensors alone as observations 1. While Reacher and Thrower showed a gradual, albeit small, decline in performance with increasing levels of flexibility, the Reacher Stay and Ant tasks performed consistently well across all levels of flexibility. This demonstrates that while end-to-end learning in flexible robots with DDPG can learn perform efficiently with flexible hardware in all tasks, it can also learn to take advantage of flexibility in robot hardware in order to perform even better.
V-B Policies learnt were robust in adapting to different levels of flexibility
The second experiment that was performed involved studying the robustness of policies learned using DDPG across different levels of flexibility. The average evaluation reward over 100 episodes was estimated for each level of flexibility that the robot was not trained on, and compared to the performance on the level of flexibility it was trained on. Policies from the end of training in all 4 tasks show that policies learned using DDPG were able to adapt robustly and earn a total reward similar to the reward received on the task they were trained on 2. In other words, the performance is invariant to the flexibility of the robot thereby making DDPG a robust approach to training flexible robots.
V-C Policy search was sensitive to sensor choice.
Policy search performs at least as well if not better with joint angles, when compared to performance with the addition of IMU sensors. Contrary to intuition, adding the additional sensor did not result in better performance. In the Reacher and Reacher Stay tasks, addition of IMU results in performance that was just as good, but in the Thrower and Ant tasks, it resulted in worse performance. DDPG performs with no difference between sensor choices in the Reacher task (3A), suffers some drop in performance of Reacher Stay with IMUs only (3B), drops in performance with IMUs only and both sensors in Thrower (3C), and finally drop significantly with flexibility for Ant with IMUs only and also with both sensors (3D). Joint angles alone is the only sensor choice for which DDPG performs consistently well across all levels of flexibility. These results demonstrate that policy search is sensitive to the choice of sensors. This is explored further next.
V-D More information with more sensors is not indicative of better performance.
While the first two results provide general support for the use of policy search in training flexible robots, the third result above, which shows DDPG to be quite susceptible to sensor choice is counter-intuitive. In order to better understand where intuitions about training with more sensors may be wrong, we analyzed if the addition of IMU sensors did in fact provide more information about the task by measuring, in a random policy, the amount of mutual information between (a) sensor observations and position of the end-effector and (b) sensor observations and reward. Note that, when the robot arm is flexible, joint angle sensors at the base would not accurately determine the position of the end-effector. Performing these analyses on the three robot settings revealed that adding IMU sensors does increases information about end-effector position thereby providing the agent with more information about the flexible dynamics of the robot in its observation (Top row in Fig. 4). Furthermore, the additional information was shown to be not just about robot dynamics but also in relevance to the task, because information about reward in the observations also consistently went up in all tasks with the addition of IMU sensors (bottom row in Fig. 4).
Interestingly, in Reacher, angle sensors provided more information than IMU about position of end-effector (Fig. 4A), but together they provided significantly more information about the reward that either one of them alone (Fig. 4D) which suggests the combination should perform really well in extreme levels of flexibility. With Thrower, most likely due to the greater impact of flexibility when the arm comes into contact with an object in the environment, IMU gives significantly more information about object position (Fig. 4B). While striking the object, the base angle of the arm could be in a variety of positions that would have never occurred in the Reacher tasks. This results in the angle sensor having very low information about the position of the end-effector. However, joint angle sensors show more information about the reward than the IMUs(Fig. 4E). This suggests that although addition of a sensor provides more information about the dynamics of the arm, that information is not quite relevant to the task, although together it does contribute to the reward to some degree. In Ant, addition of IMUs show that it did not add much to the magnitude of the information about the end-effector or reward (Fig. 4C and F), while the IMU itself provided a lot of information which suggests that the two sensors provide redundant information in different formats that the network might have to learn to parse.
Intuitive understanding about the role of additional sensors is composed of two parts - the first being, addition of more sensors should provide more information to learn from, and the second being, having access to more information should make learning easier. DDPG’s performance suggests that, in line with the first part of our intuition, addition of sensors does increase the amount of task-relevant information to the learner, but it is the second part that does not appear to hold. More information does not necessarily make it easier for policy learning. This could be further explored in future studies involving controlled experiments of systematic manipulation of information in different tasks to better understand the relationship between information and learning in deep networks using reinforcement learning.
VI DISCUSSION
In summary, through the experiments with Deep Deterministic Policy Gradients as a policy search method for training flexible robots, we have demonstrated the following (i) policy search methods are capable of end-to-end learning to perform efficiently across a wide range of flexibility in hardware (ii) policies learned using DDPG are robust in their ability to adapt to different levels of flexibility than the one they were trained on and (iii) while policy search methods learn and perform well with the correct choice of sensors, they can be susceptible to the choice of sensor. Thus, in the training of flexible robots using policy search, appropriate sensor choice is the more crucial parameter than the flexibility itself.
The ability of learned policies to be robust across different levels of hardware flexibility offers several advantages as noted previously. Besides simply dropping the constraint on rigidity requirements of the hardware, this property of learned policies also allows transfer from rigid simulators to flexible robots or from flexible robots to other differently flexible robots. Furthermore, they also allow transfer of policies between robots that were manufactured with variance in properties from one robot to another. Thus, taking this approach decreases constraints and increases generalizability in all aspects of robot design.
The work in this paper has taken a systematic approach to studying the properties of policy search methods for training flexible robots. While we have tested on a relatively new algorithm, DDPG, there are certainly other even newer algorithms such as TRPO, PPO and SAC that could be tested. Perhaps they are less susceptible to sensor choices. Future work in our group involves studying more algorithms in a greater number of tasks, learning directly in real flexible robots as opposed to simulation, and further understanding the relationship between input information to a learner and its ability to learn.
This paper demonstrates, as a proof-of-concept, the efficacy and sensitivities of using policy search for end-to-end training of flexible robots. Artificial neural networks, being universal function approximators and reinforcement learning being a very practical training approach for robots form a potent combination that simplify robot design by relaxing one of the most widely imparted constraint on robots - rigidity. This would make robots cheaper, easier to design and maintain, and more robust in the face of changes to robot dynamics that would otherwise require a complete rebuilding of its model from scratch. While there are still several aspects of this approach that are yet to be explored, such as the relationship between information content in the observations and the ability to learn, this paper demonstrates that policy search holds a lot of promise for robot design that could get closer to that of natural systems.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] S. Kim, C. Laschi, and B. Trimmer, “Soft robotics: a bioinspired evolution in robotics,” Trends in biotechnology , vol. 31, no. 5, pp. 287–294, 2013.
- 2[2] D. Rus and M. T. Tolley, “Design, fabrication and control of soft robots,” Nature , vol. 521, no. 7553, p. 467, 2015.
- 3[3] S. K. Dwivedy and P. Eberhard, “Dynamic analysis of flexible manipulators, a literature review,” Mechanism and machine theory , vol. 41, no. 7, pp. 749–777, 2006.
- 4[4] M. Benosman and G. Le Vey, “Control of flexible manipulators: A survey,” Robotica , vol. 22, no. 5, pp. 533–545, 2004.
- 5[5] M. O. Tokhi and A. K. Azad, Flexible robot manipulators: modelling, simulation and control . Iet, 2008, vol. 68.
- 6[6] C. T. Kiang, A. Spowage, and C. K. Yoong, “Review of control and sensor system of flexible manipulator,” Journal of Intelligent & Robotic Systems , vol. 77, no. 1, pp. 187–213, 2015.
- 7[7] P. Henderson, R. Islam, P. Bachman, J. Pineau, D. Precup, and D. Meger, “Deep reinforcement learning that matters,” in Thirty-Second AAAI Conference on Artificial Intelligence , 2018.
- 8[8] M. P. Deisenroth, G. Neumann, J. Peters, et al. , “A survey on policy search for robotics,” Foundations and Trends® in Robotics , vol. 2, no. 1–2, pp. 1–142, 2013.
