Viewpoint Optimization for Autonomous Strawberry Harvesting with Deep   Reinforcement Learning

Jonathon Sather; Xiaozheng Jane Zhang

arXiv:1903.02074·cs.RO·May 3, 2019

Viewpoint Optimization for Autonomous Strawberry Harvesting with Deep Reinforcement Learning

Jonathon Sather, Xiaozheng Jane Zhang

PDF

Open Access 1 Repo

TL;DR

This paper investigates using deep reinforcement learning to optimize camera viewpoints for autonomous strawberry harvesting, demonstrating significant improvements in decision-making efficiency in a simulated environment.

Contribution

It introduces a novel deep reinforcement learning algorithm for viewpoint optimization in autonomous harvesting, addressing perception bottlenecks and showing promising simulation results.

Findings

01

Agent achieves 8.7x higher returns than random actions

02

Agent explores 8.8% faster than baseline visual servoing

03

Agent successfully fixates on favorable viewpoints without explicit temporal information

Abstract

Autonomous harvesting may provide a viable solution to mounting labor pressures in the United States's strawberry industry. However, due to bottlenecks in machine perception and economic viability, a profitable and commercially adopted strawberry harvesting system remains elusive. In this research, we explore the feasibility of using deep reinforcement learning to overcome these bottlenecks and develop a practical algorithm to address the sub-objective of viewpoint optimization, or the development of a control policy to direct a camera to favorable vantage points for autonomous harvesting. We evaluate the algorithm's performance in a custom, open-source simulated environment and observe encouraging results. Our trained agent yields 8.7 times higher returns than random actions and 8.8 percent faster exploration than our best baseline policy, which uses visual servoing. Visual…

Equations8

π^{*} = π arg max E_{τ \sim p_{π} (τ)} [R]

π^{*} = π arg max E_{τ \sim p_{π} (τ)} [R]

y_{i}

y_{i}

θ

ϕ

R(s_{t})=\left\{\begin{array}[]{ll}R_{invalid}&(\theta_{t},\phi_{t})\notin\mathcal{J}\\ R_{detect}&P_{max,t}\geq P_{thresh}\\ R_{exist}&otherwise\end{array}\right.

R(s_{t})=\left\{\begin{array}[]{ll}R_{invalid}&(\theta_{t},\phi_{t})\notin\mathcal{J}\\ R_{detect}&P_{max,t}\geq P_{thresh}\\ R_{exist}&otherwise\end{array}\right.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

jsather/harvester-sim
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSmart Agriculture and AI · Plant Virus Research Studies · Greenhouse Technology and Climate Control

Full text

Viewpoint Optimization for Autonomous Strawberry Harvesting

with Deep Reinforcement learning

Jonathon Sather and Xiaozheng Jane Zhang

California Polytechnic State University

San Luis Obispo, CA

{jsather, jzhang}@calpoly.edu

Abstract

Autonomous harvesting may provide a viable solution to mounting labor pressures in the United States’s strawberry industry. However, due to bottlenecks in machine perception and economic viability, a profitable and commercially adopted strawberry harvesting system remains elusive. In this research, we explore the feasibility of using deep reinforcement learning to overcome these bottlenecks and develop a practical algorithm to address the sub-objective of viewpoint optimization, or the development of a control policy to direct a camera to favorable vantage points for autonomous harvesting. We evaluate the algorithm’s performance in a custom, open-source simulated environment and observe encouraging results. Our trained agent yields 8.7 times higher returns than random actions and 8.8 percent faster exploration than our best baseline policy, which uses visual servoing. Visual investigation shows the agent is able to fixate on favorable viewpoints, despite having no explicit means to propagate information through time. Overall, we conclude that deep reinforcement learning is a promising area of research to advance the state of the art in autonomous strawberry harvesting.

1 Introduction

Driving down Highway 101 through Santa Maria, California in the springtime, it is hard to ignore the abundance of strawberry fields on both sides of the highway. Each field is populated by dozens of seasonal laborers inching their way down the rows and stooping to manually collect ripe berries – a practice that has been largely unchanged for the past 700 years [10]. Recently, the heavy reliance on human labor has become problematic for the California strawberry industry, as an aging and ever-shrinking workforce continues to drive up costs [21].

A promising idea to address these mounting labor pressures is to use autonomous harvesting to fill the roles of human pickers. In such a system, robots navigate the strawberry field and remove ripe strawberries, eliminating the need for manual labor. Although many researchers are actively working to improve autonomous harvesting technology, a profitable and commercially adopted strawberry harvesting system remains elusive [6]. Strawberry plants present several difficulties for autonomous harvesting systems, including high levels of occlusion, lighting variation, and easily-bruised fruit. Deep reinforcement learning may provide the machinery for a harvesting agent to develop advanced control policies that capture the complex relationships required to reason about the unstructured harvesting environment. However, deep reinforcement learning systems are historically fragile and have seen limited success in real-world settings [4]. Therefore, it is not obvious whether such an approach can be successfully adapted to the autonomous harvesting domain.

We narrow the scope from the full harvesting process to the task of viewpoint optimization, or the development of a control policy to direct a camera to favorable vantage points for autonomous harvesting (Figure 1). Viewpoint optimization is a seldom-addressed paradigm that could be prepended to an autonomous harvesting pipeline to improve speed and accuracy on the remaining harvesting steps, e.g. object detection, pose determination, path planning, and berry removal [8]. More importantly, viewpoint optimization encapsulates many of the perception bottlenecks present in the full harvesting task, while significantly reducing the overall task complexity. For these reasons, we believe that a successful application of deep reinforcement learning to the viewpoint optimization objective will provide considerable insight on the framework’s potential utility in the full harvesting process.

1.1 General Approach

We develop a novel application of Deep Deterministic Policy Gradients (DDPG) [17] that leverages a pretrained object detector to provide reward feedback and facilitate autonomous training. To collect training images for the object detector, we propose a labor efficient data collection procedure, which makes the entire process require minimal human interaction in a real-world setting. We then create a multi-purpose simulated harvesting environment111https://github.com/jsather/harvester-sim using ROS [24] and Gazebo [16] which we use to train and test all components of our viewpoint optimization algorithm. Our results show that deep reinforcement learning is a promising area of research to advance the state of the art in autonomous strawberry harvesting.

2 Related Work

To the best of our knowledge, this work is the first research exploring viewpoint optimization via reinforcement learning for autonomous harvesting applications (any crop). A related idea is visual servoing [7], which is a class of control algorithms acting on image features that has been applied in the autonomous harvesting domain. Unlike our method, visual servoing typically involves hand-specifying features, whereas our algorithm learns a control policy with a data-driven approach. In [18], Mehta and Burks implement visual servoing by means of two cameras: one in the hand of a citrus harvesting robot and one stationary camera with a wide field of view. The feedback from the cameras is then used to create a perspective image and guide the robotic manipulator towards an artificial citrus fruit. One of the main limitations of this approach is that it requires the target fruit to be visible by the fixed camera, which cannot be guaranteed in unstructured environments.

The main algorithms used in this research are Deep Deterministic Policy Gradients (DDPG) [17] and You Only Look Once, Version 2 (YOLOv2) [26]. DDPG is an off-policy, actor-critic deep reinforcement learning algorithm that is used as the underlying machinery for the viewpoint optimization problem. YOLOv2 is a single-shot object detection algorithm using convolutional neural networks. It is capable of outputting labeled bounding boxes at above real-time speeds using consumer-level hardware. In this research, we train YOLOv2 to detect strawberries and use its output as a feedback mechanism during the reinforcement learning process.

3 Preliminaries

Reinforcement learning problems are often modeled as a Markov Decision Process (MDP), which describes a discrete-time, stochastic environment with a decision-making agent [5]. MDPs are defined by the 5-tuple $\{S,A,P,R,\gamma\}$ , where:

$s_{t}\in S$ is the set of all states, each $s_{t}$ containing all relevant information about the environment at time $t$ . 2. 2.

$a_{t}\in A$ is the set of all possible actions in the environment at time $t$ . 3. 3.

$P(s_{t+1}|s_{t},a_{t})$ is the state-transition probability for state $s_{t+1}$ given state $s_{t}$ and action $a_{t}$ . 4. 4.

$R(r_{t}|s_{t},a_{t})$ is the reward probability for $r_{t}$ given state $s_{t}$ and action $a_{t}$ . 5. 5.

$\gamma\in[0,1)$ is the discount factor, which is used to geometrically decay the value of future rewards and often aids in algorithm convergence [28].

The resulting process is characterized by a cyclic interplay between an agent and its environment. At timestep $t$ , an agent observes state $s_{t}$ and subsequently takes action $a_{t}$ according to its policy $\pi(a_{t}|s_{t})$ . The agent then arrives at state $s_{t+1}$ through the environment’s dynamics $P(s_{t+1}|s_{t},a_{t})$ and receives scalar reward $r_{t}$ according to $R(r_{t}|s_{t},a_{t})$ . This process repeats, now from $s_{t+1}$ , until a termination criterion is reached.

A roll-out of states and actions from initial state $s_{0}$ until termination is called a trajectory, denoted $\tau=(s_{0},a_{0},s_{1},a_{1},\ldots{},s_{T-1},a_{T-1})$ . We denote the discounted return for a given trajectory as $\mathcal{R}=\sum^{T-1}_{t=0}\gamma^{t}r_{t}$ . For our problem, the goal of reinforcement learning is discover an optimal policy $\pi^{*}$ that maximizes expected return under trajectory distribution $p_{\pi}(\tau)$ . Formally:

[TABLE]

In most practical settings, it is infeasible to explicitly represent the policy or expected return at each $s\in S$ and $a\in A$ , so it is common to use function approximators to characterize these quantities at regions of interest. DDPG uses neural networks to approximate policy $\pi_{\phi}$ and state-action value function $Q_{\theta}(s_{t},a_{t})\approx\mathbb{E}_{\tau\sim p_{\pi}(\tau)}[R|s_{t},a_{t}]$ . During each training iteration, network weights are jointly updated to maximize the reinforcement learning objective using DDPG, an implementation of Deterministic Policy Gradients (DPG) [27] adapted to work deep neural networks.

Specifically, $Q_{\theta}$ is updated using a variant of fitted Q-iteration with deterministic policy $\pi_{\phi}$ used to approximate $\operatorname*{arg\,max}_{a}Q_{\theta}(s_{t},a)$ . $\pi_{\phi}$ is updated using the deterministic policy gradient to maximize $Q_{\theta}(s_{t},\pi_{\phi}(a_{t}|s_{t}))$ . Lillicrap et al. leverage techniques inspired by Deep Q Networks (DQN) [20], such as experience replay and target networks $Q_{\theta^{\prime}},\pi_{\phi^{\prime}}$ , to stabilize training. This gives way to the following off-policy updates:

[TABLE]

where subscript $i$ is used to reference elements of the current training batch. Further details can be found in [17].

3.1 A Note on Partial Observability

It is worth noting that while the the forgoing discussion assumes the reinforcement learning problem could be framed as a Markov Decision Process, this is not always the case. In many real-world problems, such as our formulation of viewpoint optimization, each observation taken in the environment may not contain all relevant information to maximize expected return. In this case, the environment is said to be partially observed, which often warrants a generalization of the MDP framework [14]. Despite this, we use algorithms designed for fully observed MDPs in a partially observed setting. Although we lose some theoretical support, this simplifies our implementation while still facilitating meaningful insights for reinforcement learning’s promise in autonomous harvesting.

4 Method

4.1 System Setup

We define our environment to consist of a high degree-of-freedom articulated robot with an RGB camera mounted to its end effector, positioned over an outdoor strawberry plant. To avoid plant collisions, we constrain the end-effector to move on a hemisphere of a fixed radius above the strawberry plant as shown in in Figure 2. This allows us to define its position by two rotation angles $j=(\theta,\phi)$ . The set of all reachable positions combined with all possible camera images defines our state space, $S=\{\mathcal{J},\mathcal{O}\}$ . We define our action space as incremental positions on this hemisphere $a=(\Delta\theta,\Delta\phi)$ and use off-the-shelf trajectory planning software and proportional-integral-derivative (PID) joint controllers to move between states.

At each state, we use confidence values from a pretrained strawberry detector and return positive reward if ripe strawberry confidence is above a given threshold. We also penalize actions that lead to unreachable positions and impose a light “existence penalty” to encourage exploration when no ripe strawberries are visible. Intuitively, this reward scheme encourages the agent to move to viewpoints that yield high probability of ripe strawberry detection, which we assume correspond to favorable viewpoints for the harvesting process. Note that even if this assumption is not fully met, we can still gain insight from the agent’s performance, as the task nevertheless requires the agent to overcome common perception challenges in a stochastic harvesting environment. Formally, the reward scheme is specified as follows:

[TABLE]

where $P_{max,t}$ and $P_{thresh}$ correspond to the maximum ripe strawberry confidence output by the detector at timestep $t$ and the minimum confidence threshold, respectively. In our experiments, we use $R_{detect}=1.0$ , $R_{invalid}=-1.0$ , $R_{exist}=-0.1$ , and $P_{thresh}=0.6$ .

One of the primary benefits of our reward specification is that it does not require a human-in-the-loop to provide external feedback. This makes it convenient to train the harvester for extended intervals in environments where policy execution is expensive, e.g. the real world. A practical limitation of this approach is that we are relying on a strawberry detector to produce “correct” annotations without reference to ground-truth values, which may allow the agent to exploit false positive rewards. To mitigate this limitation, we ensure the strawberry detector predicts robust and high-precision bounding boxes before integrating it into the reinforcement learning algorithm.

4.2 Detector

While our reward scheme eliminates the need for human feedback during reinforcement learning, it introduces the new burden of collecting and annotating images for pretraining the strawberry detector. To combat this burden, we develop a labor-efficient data collection procedure that leverages the fact that multiple training images can be annotated per plant given knowledge of its berries’ locations in 3D space. This procure is outlined in Algorithm 1. Note that we collect data and train our detector for both ripe and unripe strawberries, despite only using ripe detections in our reward scheme. This is done to increase the detector’s ability to discriminate between the two classes and reduce the frequency of false positive detections.

In a real-world harvesting context, the ground truth 3D strawberry labels in Algorithm 1 can be obtained by kinesthetically guiding the robot’s end effector around each strawberry to log it’s location and approximate diameter. In the simulated environment, we replace this step by directly using the ground truth poses and sizes of each strawberry. To simplify the mapping from 3D pose to bounding box, we approximate each strawberry as a perfect sphere.

During the annotation step, we account for occlusions by rejecting bounding boxes with insufficient pixels within an empirically-determined hue range. We find this occlusion heuristic results in satisfactory annotations on our simulated data without requiring manual labelling. We note that a more sophisticated occlusion removal strategy may be needed in a real-world setting.

We use Algorithm 1 to collect over 7000 images in the simulated environment, designating 80% of dataset for training/validation and the remaining 20% for testing. We then train YOLOv2 for 50,000 epochs using the Darknet framework [25] with default hyperparameters as specified in [26]. The trained detector is able to perform real-time detections ( $>30$ frames per second) using a Tesla K80 GPU.

4.3 Reinforcement Learning

We frame the viewpoint optimization problem as an episodic MDP, where each episode the agent is spawned at fixed location $j_{0}=(\theta_{0},\phi_{0})$ relative to an arbitrary strawberry plant. In a real-world harvesting environment, this could be implemented by mounting the manipulator to an unmanned ground vehicle and navigating from plant to plant between episodes. The entire process can the be executed without human intervention, provided the robot is equipped with appropriate vision software for autonomous navigation and safe operation. In the simulated environment, this process is mimicked using a generative strawberry plant model to initialize a new plant configuration at the beginning of each episode.

We use DDPG to train the harvesting agent for viewpoint optimization, as its off-policy nature allows for sample efficient training updates. Additionally, DDPG has seen success using convolutional neural networks to process raw pixel inputs [17]. Our implementation is very similar to [17], except we perform policy execution and training updates in parallel to maximize sample efficiency (inspired by [29]).

For the both the actor $\pi_{\phi}$ and critic $Q_{\theta}$ , we use five $3\times 3\times 32$ convolutional layers with stride 2 to process the $800\times 800\times 3$ raw pixel input. In the actor network, this is followed by two 200-neuron fully connected layers with tanh activations. The critic network has a similar structure, except the first fully connected layer is concatenated with the action input, and the final (scalar) output is a linear activation. The networks use the same uniform initialization scheme and weight regularization as in [17]. Training is performed in batch sizes of 16 and updates are performed with Adam optimizer [15] using learning rates of $1\times 10^{-4}$ and $1\times 10^{-3}$ for the actor and critic, respectively, and default values $\beta_{1}=0.9$ , $\beta_{2}=0.999$ , $\hat{\epsilon}=1\times 10^{-8}$ from the paper. The target networks are updated using Polyak Averaging [23] with mixing parameter $\tau=1\times 10^{-3}$ . We implement the network architectures using TensorFlow [3] and train for 10,000 episodes in the simulated environment. On a physical system, this equates to approximately one week of training.

4.4 Simulated Environment

We create a simulated environment using Robot Operating System (ROS) [24] and Gazebo [16] with the goal of being sufficiently realistic so that results in the simulated environment are indicative of performance in real-world settings. The virtual world mimics an open strawberry field and consists of a section of a bed with dirt surroundings and a randomly-generated strawberry model. The strawberry model incorporates real-world perception challenges such as heavy occlusion, complex textures, shadows, and stochasticity. Within the world, we place a “floating arm” harvester centered about the strawberry plant. The arm is modelled after the JACO by Kinova Robotics [2].

Plant models are generated using using Gazebo’s native SDF file format with embedded Ruby [1]. The algorithm creates plants based on a simple structural model in which various parameters, such as berry pose, mesh, and ripeness, are sampled to create unique plant configurations on the fly. All of the meshes are custom made, with the exception of the strawberry flesh, which uses down-sampled meshes from the UC Davis Strawberry Database [9].

5 Results

5.1 Detector Evaluation

We evaluate the learned strawberry detector on 2000 held-out images to generate the precision-recall plot in Figure 3. The plot shows the relationship between precision and recall as the YOLOv2 minimum confidence is varied from 0.01 to 1.0 for 9 intersection-over-union (IOU) thresholds. Looking at the “traditional” IOU metric of 0.5, we see that there is a gradual linear decrease in precision with a large increase in recall as the threshold is lowered from 0.75 to 0.5. Lowering the threshold further results in a sharp decrease in precision and asymptotic recall to 0.5.

In reinforcement learning, it is important to have a strong reward signal so that the agent can understand the consequences of its actions. As such, we err on the side of high precision/low recall and select $R_{thresh}=0.6$ as our reward threshold, corresponding to a precision of 0.9 and a recall of less than 0.2. While such a low recall may initially raise some red flags, it is important to note that its implications largely depend on the nature of the strawberries that are ignored. For this, we turn to a visual analysis.

Looking at the behavior of the detector frame-by-frame, we note its performance is relatively consistent across the camera images. It appears to prefer strawberries that are closer to the camera and larger, with more red flesh of the strawberry showing corresponding to higher confidence values. The strawberries that are missed by the detector are typically smaller or heavily occluded. Therefore, as a feedback mechanism for viewpoint optimization, the biases exhibited by the detector are likely beneficial. An example annotation is shown in Figure 4.

5.2 Policy Performance

To assess the performance of the learned policy, we compare several performance metrics versus five baseline policies and a “hybrid” policy. These policies are listed below in order of increasing complexity.

Random policy, $\pi_{1}$ : At each timestep, the agent takes a uniformly sampled action in $\Delta\theta$ and $\Delta\phi$ . 2. 2.

Random policy with boundary awareness, $\pi_{2}$ : At each timestep, the agent takes a uniformly sampled action in $\Delta\theta$ and $\Delta\phi$ , taking opposite actions to stay in bounds as needed. 3. 3.

Downward heuristic with boundary awareness, $\pi_{3}$ : Given threshold $\phi^{*}\in(0,\frac{\pi}{2})$ . At each timestep, the policy selects a downward action along the hemisphere until it reaches threshold $\phi^{*}$ . Once it crosses the threshold, the policy takes random actions in accordance with $\pi_{2}$ . 4. 4.

Frozen detector with downward heuristic and boundary awareness, $\pi_{4}$ : At each timestep, the policy first runs the pretrained strawberry detector on observation $o_{t}$ , and obtains coordinates $(x_{t},y_{t})$ of the most confident ripe detection in the image frame. If a ripe strawberry is detected, the policy outputs zero action. Otherwise, the policy moves in accordance with $\pi_{3}$ . 5. 5.

Proportional detector with downward heuristic and boundary awareness, $\pi_{5}$ : At each timestep, the policy first runs the pretrained strawberry detector on observation $o_{t}$ , and obtains coordinates $(x_{t},y_{t})$ of the most confident ripe detection in the image frame. If a ripe strawberry is detected, the policy outputs action proportional to the direction of its bounding box in the image plane. Otherwise, the policy moves in accordance with $\pi_{3}$ . 6. 6.

Hybrid policy: Given threshold $\phi^{*}\in(0,\frac{\pi}{2})$ . At each timestep, the policy selects a downward action along the hemisphere until it reaches threshold $\phi^{*}$ . Once it crosses the threshold, the agent follows the learned policy (DDPG), taking opposite actions to stay in-bounds as needed.

Note that for the detector-based policies we use a bounding box threshold of 0.5 to represent a positive detection, instead of 0.6 used to derive the reward values. This prevents the baseline detectors from having “insider knowledge” of the reward scheme and makes the resulting comparison less biased.

We run each of the policies for 100 episodes on previously unseen strawberry plants, recording the reward at each timestep and number of steps per episode. Using this information, we calculate the mean return for each episode and the number of timesteps until first reward, excluding trials without a reward. These data are summarized in Figures 5 and 6.

We observe the trained agent performs 8.8 times better with respect to the reinforcement learning objective than random actions, while it performs on-par with the most proficient baseline. Adding hard-coded heuristics increases return by 2.5 times, making the hybrid policy by far the best performing of those tested. Looking at mean timesteps until reward, we see an opposite trend when introducing the hybrid policy. In this instance, the trained agent performs better than all baselines by over one timestep, while the hybrid policy only averages nearly six more timesteps than the trained agent. These findings highlight that simple hard-coded rules may drastically improve a data-driven policy, but care must be taken to ensure modifications do not have unintended consequences.

5.3 Fixation Analysis

After determining a high-reward vantage point for detecting ripe strawberries, it is optimal for an agent to remain fixated at that location for the remainder of the episode. Fixation behaviors can also be detrimental if they occur on a vantage point with low return. In this section, we seek to characterize the learned agent’s fixation tendencies in an attempt to better understand the inner workings of its policy.

To detect instances of fixation, we run the DDPG policy on over 400 plants and track positions and rewards for each episode. We plot the corresponding trajectories on a hemisphere superimposed above the plant models, denoting detection states with a blue star. Examples of these visualizations are shown in Figure 7. We then manually inspect the plots and note trajectories where the agent exhibited fixation behaviors.

Of the 213 trajectories with more than 50 steps, 82 fixate on high-return regions, 64 fixate on low-return regions, and 67 do not appear to fixate. From this, it appears the agent learns the desired fixation behavior, but the policy lacks robustness to extend to all states. We suspect some of the low-return clusters are a pitfall of partial observability: Conflicting interpretations of plant geometry at adjacent viewpoints could result in opposite actions for exploration, even if a ripe strawberry is not in frame.

Using saved plant models from experimental trials, we move the agent to known fixation locations and manually inspect the camera images. On regions with high return, we observe the agent tends to favor viewpoints with ripe strawberries in close proximity with minimal occlusions (Figure 1). Such viewpoints are not only advantageous for detection but likely benefit the remaining steps of the harvesting process, providing better angles for pose prediction and reducing obstacles to simplify planning and execution of harvesting trajectories.

On the other regions, the image contents are more varied, but we notice that several of the viewpoints display ripe, or nearly-ripe, strawberries in frame that are not picked up by the pretrained detector. An example of this phenomenon is shown in Figure 8. In these instances, it is possible that the DDPG policy over-estimates the returns from these viewpoints and thus gravitates towards them. As a whole, it appears that the agent learns to exploit the strengths of the pretrained strawberry detector and generally navigates towards regions where it has a high probability of detection. This is particularly impressive considering the limitations of partial observability.

6 Conclusion

6.1 Contributions

In this research, we formulated a novel application of reinforcement learning to solve the viewpoint optimization problem for autonomous strawberry harvesting. We showed that feedback from a pretrained strawberry detector could be used as an autonomous reward scheme, eliminating the need for a human in the loop during the training process. In doing so, we developed a labor efficient data collection procedure for capturing and annotating strawberry images to train the strawberry detector. To train and test our algorithms, we created a realistic simulated environment incorporating many harvesting challenges found in real-world contexts, such as apparent randomness, frequent occlusions, complex textures, and lighting variation.

Within the simulated environment, we saw the agent performed favorably with respect to the reinforcement learning objective and time-to-detection metrics. Additionally, we noted that the nature of the agent’s fixated viewpoints not only aids ripe strawberry detection, but also provides desirable angles for pose determination, trajectory planning, and trajectory execution. Therefore, we are optimistic that the viewpoint optimization algorithm would have practical merit in real-world settings.

6.2 Future Works.

The research presented in this paper indicates that reinforcement learning is a promising method to improve the autonomous strawberry harvesting pipeline and serves as a branching-off point for many future developments. In this section, we briefly outline three recommended directions for future work.

Parallelization.

Our training procedure consisted of two independent processes: a worker collecting training data and an asynchronous update procedure using samples from the experience replay. In future work, this could be extended to include multiple workers for the data collection process with minimal modifications to the underlying algorithm. In [11], parallelization was used to significantly speed up deep reinforcement learning for a real-world manipulation task. Such a modification is a natural extension to autonomous harvesting since most existing systems already use multiple workers during the harvesting process.

Partial Observability.

In Section 4, we defined the agent’s “state” to be its raw camera image and its end-effector position. This presents problems in a heavily occluded environments due to partial observability at each timestep, which may lead to sub-optimal actions. If the agent had some mechanism to propagate relevant state information through time, then it would be able to overcome this limitation. We propose several improvements for future work. First, the state itself could be modified to include additional information from previous state(s). In the simplest case, this could consist of appending the previous position to the current state vector, while a more extreme case is frame stacking. These strategies are limited by their finite horizon and linear memory complexity. A more sophisticated (and likely more successful) approach would be to provide a framework for the agent to learn which information to propagate through time. Possible frameworks for this include recurrent neural networks [19][13][12] or an external memory unit [22][30].

Physical Implementation

It is imperative to go beyond the simulated environment and evaluate our algorithms on a physical system. While the simulated environment is convenient and enables a rapid development cycle, without physical testing we can only speculate on reinforcement learning’s utility in real-world harvesting applications. By comparing results between the two domains, we also will be able to improve the simulated environment and better understand its limitations for future research.

Bibliography30

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] Erb - ruby templating. http://ruby-doc.org/stdlib-2.6.1/libdoc/erb/rdoc/ERB.html .
2[2] Kinova website. https://www.kinovarobotics.com/en/products/assistive-technologies/kinova-jaco-assistive-robotic-arm .
3[3] Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dandelion Mané, Rajat Monga, Sherry Moore, Derek Murray, Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, Vincent Vanhoucke, Vijay Vasud
4[4] K. Arulkumaran, M. P. Deisenroth, M. Brundage, and A. A. Bharath. Deep reinforcement learning: A brief survey. IEEE Signal Processing Magazine , 34(6):26–38, Nov 2017.
5[5] Richard Bellman. A markovian decision process. Indiana Univ. Math. J. , 6:679–684, 1957.
6[6] Dan Charles. Robots are trying to pick strawberries. so far, they’re not very good at it. https://www.npr.org , Mar 2018. Online; Accessed October 2018.
7[7] François Chaumette and Seth Hutchinson. Visual servo control. i. basic approaches. IEEE Robotics & Automation Magazine , 13(4):82–90, 2006.
8[8] S.G. Defterli, Yeyin Shi, Yunjun Xu, and Reza Ehsani. Review of robotic technology for strawberry production. 32:301–318, 01 2016.