Zermelo's problem: Optimal point-to-point navigation in 2D turbulent   flows using Reinforcement Learning

Luca Biferale; Fabio Bonaccorso; Michele Buzzicotti; Patricio Clark Di; Leoni; Kristian Gustavsson

arXiv:1907.08591·nlin.CD·January 8, 2020

Zermelo's problem: Optimal point-to-point navigation in 2D turbulent flows using Reinforcement Learning

Luca Biferale, Fabio Bonaccorso, Michele Buzzicotti, Patricio Clark Di, Leoni, Kristian Gustavsson

PDF

TL;DR

This paper demonstrates that Reinforcement Learning can effectively solve Zermelo's problem in 2D turbulent flows, providing robust navigation strategies that outperform traditional optimal control methods in dynamic and noisy environments.

Contribution

The study introduces an Actor-Critic RL approach for Zermelo's problem in turbulent flows, showing its robustness and superiority over analytical solutions in practical scenarios.

Findings

01

RL solutions are more robust to initial condition changes and noise.

02

RL outperforms analytical ON strategies in dynamic flow conditions.

03

RL effectively exploits flow properties for navigation with low steering speed.

Abstract

To find the path that minimizes the time to navigate between two given points in a fluid flow is known as Zermelo's problem. Here, we investigate it by using a Reinforcement Learning (RL) approach for the case of a vessel which has a slip velocity with fixed intensity, Vs , but variable direction and navigating in a 2D turbulent sea. We show that an Actor-Critic RL algorithm is able to find quasi-optimal solutions for both time-independent and chaotically evolving flow configurations. For the frozen case, we also compared the results with strategies obtained analytically from continuous Optimal Navigation (ON) protocols. We show that for our application, ON solutions are unstable for the typical duration of the navigation process, and are therefore not useful in practice. On the other hand, RL solutions are much more robust with respect to small changes in the initial conditions and to…

Equations44

\begin{cases}\dot{\bm{X}}_{t}={{\color[rgb]{0,0,0}{\bm{u}}({\bm{X}}_{t},t)}}+{\bm{U}}^{ctrl}({\bm{X}}_{t})\\ {\bm{U}}^{ctrl}({\bm{X}}_{t})=V_{\rm s}{\bm{n}}({\bm{X}}_{t})\end{cases}

\begin{cases}\dot{\bm{X}}_{t}={{\color[rgb]{0,0,0}{\bm{u}}({\bm{X}}_{t},t)}}+{\bm{U}}^{ctrl}({\bm{X}}_{t})\\ {\bm{U}}^{ctrl}({\bm{X}}_{t})=V_{\rm s}{\bm{n}}({\bm{X}}_{t})\end{cases}

U^{c t r l} (X_{t}) = V_{s} n (X_{t})

U^{c t r l} (X_{t}) = V_{s} n (X_{t})

n (X_{t}) = (cos [θ_{t}], sin [θ_{t}]),

n (X_{t}) = (cos [θ_{t}], sin [θ_{t}]),

\tilde{V}_{s} = V_{s} / u_{m a x} .

\tilde{V}_{s} = V_{s} / u_{m a x} .

\dot{θ}_{t} = A_{21} sin^{2} θ_{t} - A_{12} cos^{2} θ_{t} + (A_{11} - A_{22}) cos θ_{t} sin θ_{t},

\dot{θ}_{t} = A_{21} sin^{2} θ_{t} - A_{12} cos^{2} θ_{t} + (A_{11} - A_{22}) cos θ_{t} sin θ_{t},

r_{tot} = t \sum r_{t},

r_{tot} = t \sum r_{t},

r_{t} = - Δ t + \frac{∣ x _{B} - X _{t - Δ t} ∣}{V _{s}} - \frac{∣ x _{B} - X _{t} ∣}{V _{s}},

r_{t} = - Δ t + \frac{∣ x _{B} - X _{t - Δ t} ∣}{V _{s}} - \frac{∣ x _{B} - X _{t} ∣}{V _{s}},

r_{tot} = - T_{A \to B} + \frac{∣ x _{B} - X _{t_{0}} ∣}{V _{s}} - \frac{d _{B}}{V _{s}} .

r_{tot} = - T_{A \to B} + \frac{∣ x _{B} - X _{t_{0}} ∣}{V _{s}} - \frac{d _{B}}{V _{s}} .

T_{A \to B}^{free} = \frac{∣ x _{B} - x _{A} ∣}{V _{s}} .

T_{A \to B}^{free} = \frac{∣ x _{B} - x _{A} ∣}{V _{s}} .

π (a_{j} ∣ s_{i}; q) = \frac{exp h ( s _{i} , a _{j} , q )}{\sum _{k = 1}^{N_{a}} exp h ( s _{i} , a _{k} , q )},

π (a_{j} ∣ s_{i}; q) = \frac{exp h ( s _{i} , a _{j} , q )}{\sum _{k = 1}^{N_{a}} exp h ( s _{i} , a _{k} , q )},

\overset{v}{^} (s_{i}, w) = i^{'} = 1 \sum N_{s} w_{i^{'}} y_{i^{'}} (s_{i}),

\overset{v}{^} (s_{i}, w) = i^{'} = 1 \sum N_{s} w_{i^{'}} y_{i^{'}} (s_{i}),

\overset{r}{^}_{t + Δ t} = r_{t + Δ t} + \overset{v}{^} (s_{t + Δ t}, w) .

\overset{r}{^}_{t + Δ t} = r_{t + Δ t} + \overset{v}{^} (s_{t + Δ t}, w) .

{q_{t + Δ t} = q_{t} + α_{t} β_{t} \nabla_{q} ln (π (a_{t} ∣ s_{t}, q_{t})) w_{t + Δ t} = w_{t} + α_{t}^{'} β_{t} \nabla_{w} \overset{v}{^} (s_{t}, w_{t}),

{q_{t + Δ t} = q_{t} + α_{t} β_{t} \nabla_{q} ln (π (a_{t} ∣ s_{t}, q_{t})) w_{t + Δ t} = w_{t} + α_{t}^{'} β_{t} \nabla_{w} \overset{v}{^} (s_{t}, w_{t}),

β_{t} = [\overset{r}{^}_{t + Δ t} - \overset{v}{^} (s_{t}, w_{t})] .

β_{t} = [\overset{r}{^}_{t + Δ t} - \overset{v}{^} (s_{t}, w_{t})] .

{q_{t + Δ t} = q_{t} + α_{t} β_{t} [z (s_{t}, a_{t}) - \sum_{j = 1}^{N_{a}} π (a_{j} ∣ s_{t}, q_{t}) z (s_{t}, a_{j})] w_{t + Δ t} = w_{t} + α_{t}^{'} β_{t} y (s_{t}) .

{q_{t + Δ t} = q_{t} + α_{t} β_{t} [z (s_{t}, a_{t}) - \sum_{j = 1}^{N_{a}} π (a_{j} ∣ s_{t}, q_{t}) z (s_{t}, a_{j})] w_{t + Δ t} = w_{t} + α_{t}^{'} β_{t} y (s_{t}) .

Δ_{O W} = (A_{11} - A_{22})^{2} + (A_{21} + A_{12})^{2} - (A_{21} - A_{12})^{2} .

Δ_{O W} = (A_{11} - A_{22})^{2} + (A_{21} + A_{12})^{2} - (A_{21} - A_{12})^{2} .

g (s_{i}) = a_{j} max π^{*} (a_{j}, s_{i})

g (s_{i}) = a_{j} max π^{*} (a_{j}, s_{i})

\partial_{t} u_{i} + u_{j} \partial_{j} u_{i} = - \partial_{i} p + ν Δ u_{i} + f_{i}

\partial_{t} u_{i} + u_{j} \partial_{j} u_{i} = - \partial_{i} p + ν Δ u_{i} + f_{i}

H (X_{t}, p_{t}, θ_{t}) = [u (X_{t}) + V_{s} n (θ_{t})] \cdot p_{t} - 1,

H (X_{t}, p_{t}, θ_{t}) = [u (X_{t}) + V_{s} n (θ_{t})] \cdot p_{t} - 1,

\dot{X}_{t}

\dot{X}_{t}

\dot{p}_{t}

0

\int_{0}^{T_{A \to B}} d t L = \int_{0}^{T_{A \to B}} d t [\dot{X}_{t} \cdot p_{t} - H] = \int_{0}^{T_{A \to B}} d t = T_{A \to B} .

\int_{0}^{T_{A \to B}} d t L = \int_{0}^{T_{A \to B}} d t [\dot{X}_{t} \cdot p_{t} - H] = \int_{0}^{T_{A \to B}} d t = T_{A \to B} .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings

Full text

Zermelo’s problem: Optimal point-to-point navigation in 2D turbulent flows using Reinforcement Learning 111article submitted to AIP Publishing Chaos, Focus Issue: When Machine Learning Meets Complex Systems: Networks, Chaos and Nonlinear Dynamics (2019)

L. Biferale

Dept. Physics and INFN University of Rome Tor vergata, via della Ricerca Scientifica 1, 00133 Rome, Italy.

F. Bonaccorso

Dept. Physics and INFN University of Rome Tor vergata, via della Ricerca Scientifica 1, 00133 Rome, Italy.

Center for Life Nano Science@La Sapienza, Istituto Italiano di Tecnologia, 00161 Roma, Italy

M. Buzzicotti

Dept. Physics and INFN University of Rome Tor vergata, via della Ricerca Scientifica 1, 00133 Rome, Italy.

P. Clark Di Leoni

Dept. Physics and INFN University of Rome Tor vergata, via della Ricerca Scientifica 1, 00133 Rome, Italy.

Dept. of Mechanical Engineering, Johns Hopkins University, Baltimore, Maryland 21218, USA.

K. Gustavsson

Dept. of Physics, University of Gothenburg, Gothenburg, 41296, Sweden.

Abstract

To find the path that minimizes the time to navigate between two given points in a fluid flow is known as Zermelo’s problem. Here, we investigate it by using a Reinforcement Learning (RL) approach for the case of a vessel which has a slip velocity with fixed intensity, $V_{\rm s}$ , but variable direction and navigating in a 2D turbulent sea. We show that an Actor-Critic RL algorithm is able to find quasi-optimal solutions for both time-independent and chaotically evolving flow configurations. For the frozen case, we also compared the results with strategies obtained analytically from continuous Optimal Navigation (ON) protocols. We show that for our application, ON solutions are unstable for the typical duration of the navigation process, and are therefore not useful in practice. On the other hand, RL solutions are much more robust with respect to small changes in the initial conditions and to external noise, even when $V_{\rm s}$ is much smaller than the maximum flow velocity. Furthermore, we show how the RL approach is able to take advantage of the flow properties in order to reach the target, especially when the steering speed is small.

**Zermelo’s point-to-point optimal navigation problem in the presence of a complex flow is key for a variety of geophysical and applied instances. In this work, we apply Reinforcement Learning to solve Zermelo’s problem in a multi-scale 2d turbulent snapshot for both frozen-in-time velocity configurations and fully time-dependent flows. We show that our approach is able to find the quasi-optimal path to navigate from two distant points with high efficiency, even comparing with policies obtained from optimal control theory. Furthermore, we connect the learned policy with the topological flow structures that must be harnessed by the vessel to navigate fast. Our result can be seen as a first step towards more complicated applications to surface oceanographic problems and/or 3d chaotic and turbulent flows. **

I Introduction

Path planning for small autonomous marine vehicles Petres et al. (2007); Witt and Dunbabin (2008) such as wave and current gliders Kraus (2012); Smith et al. (2011), active drifters Lumpkin and Pazos (2007); Niiler (2001), buoyant underwater explorers, and small swimming drones is key for many geo-physical Lermusiaux et al. (2017) and engineering Bechinger et al. (2016); Kurzthaler et al. (2018); Popescu, Tasinkevych, and Dietrich (2011); Baraban et al. (2012) applications. In nature, these vessels are affected by environmental disturbances like wind, waves and ocean currents, often in competition, and often characterized by unpredictable (chaotic) evolutions. This is problematic when one wants to send probes to specific locations, for example when trying to optimize data-assimilation for environmental applications Lermusiaux et al. (2017); Carrassi et al. (2018); Lakshmivarahan and Lewis (2013); Clark Di Leoni, Mazzino, and Biferale (2018, 2019). Most of the times, a dense set of fixed platforms or manned vessels are not economically viable solutions. As a result, scientists rely on networks of moving sensors, e.g. near-surface currents drifters Centurioni (2018) or buoyant explorers Roemmich et al. (2009). In both cases, the platforms move with the surface flow (or with a depth current) and are either fully passive Centurioni (2018) or inflatable/deflatable with some predetermined scheduled protocol Roemmich et al. (2009). The main drawback is that they might be distributed in a non-optimal way, as they might accumulate in uninteresting regions, or disperse away from key points. Beside this applied motivation, the problem of (time) optimal point-to-point navigation in a flow, known as Zermelo’s problem Zermelo (1931), is interesting per se in the framework of Optimal Control Theory Bryson and Ho (1975); Ben-Asher (2010); Liebchen and Löwen (2019); Hays et al. (2014).

In this paper, we tackle Zermelo’s problem for a specific but important application, the case of a two-dimensional fully turbulent flowXia et al. (2011); Boffetta and Ecke (2012); Alexakis and Biferale (2018) with an entangled distribution of complex spatial features, such as recirculating eddies or shear-regions, and with multi-scale spectral properties (see Fig. 1 for a graphical summary of the problem). In such conditions, even for time-indenpendet flow configurations, trivial or naive navigation policies can be extremely inefficient and ineffective. To overcome this, we have implemented one approach based on Reinforcement Learning (RL) Sutton and Barto (2018) in order to find a robust quasi-optimal policy that accomplish the task, and we applied it to time-independent flow configurations and to the case when the underlying flow evolves with its own chaotic and turbulent dynamics given by the Navier-Stokes equations. Furthermore, we compare RL with an approach based on Optimal Navigation (ON) theory Rugh and Rugh (1996); Pontryagin (2018). To the best of our knowledge, only simple advecting flows have been studied so far for both ON Techy (2011); Yoo and Kim (2016); Liebchen and Löwen (2019) and RL Yoo and Kim (2016).

Promising results have been obtained when applying RL algorithms to similar problems, such as the training of smart inertial particles or swimming particles navigating intense vortex regions Colabrese et al. (2018), Taylor Green flows Colabrese et al. (2017) and ABC flows Gustavsson et al. (2017). RL has also been successfully implemented to reproduce schooling of fishes Gazzola et al. (2016); Verma, Novati, and Koumoutsakos (2018), soaring of birds in a turbulent environments Reddy et al. (2016, 2018) and in many other applications Muinos-Landin, Ghazi-Zahedi, and Cichos (2018); Novati, Mahadevan, and Koumoutsakos (2018); Tsang et al. (2018). Similarly, in the recent years, artificial intelligence techniques are establishing themselves as new data driven models for fluid mechanics in general Pathak et al. (2017); King et al. (2018); Vlachas et al. (2018); Lu, Hunt, and Ott (2018); Mohan et al. (2019); Brunton, Noack, and Koumoutsakos (2019).

In this paper, we show that for the case of vessels that have a slip velocity with fixed intensity but variable direction, RL can find a set of quasi-optimal paths to efficiently navigate the flow. Moreover, RL, unlike ON, can provide a set of highly stable solutions, which are insensitive to small disturbances in the initial condition and successful even when the slip velocity is much smaller than the guiding flow. We also show how the RL protocol is able to take advantage of different features of the underlying flow in order to achieve its task, indicating that the information it learns is non-trivial.

The paper is organized as follows. In Sec. II we present the general set-up of the problem, write the equations of motion of the vessels used, we give details on the underlying flow and tasks. In Sec. III we first present an overview of the RL algorithm used in this paper and then show the results obtained using it, while in Sec. IV we do the same for the ON case. In Sec. V we compare the results obtained from the two approaches. Finally, we give our conclusions in Sec. VI.

II Problem set-up

For our analysis we use one static snapshot from a numerical realization of 2D turbulence, and try to learn the optimal path connecting two different sets of starting and ending points; we call these problems P1 and P2, respectively. In Fig. 1 we show a sketch of the set-up (see the caption of the figure for further details on the turbulent realization used and how it was generated). Our goal is to find (if they exist) trajectories that join the region close to $\bm{x}_{A_{1}}$ with a target close to $\bm{x}_{B_{1}}$ (problem P1) and $\bm{x}_{A_{2}}$ with $\bm{x}_{B_{2}}$ (problem P2) in the shortest possible time assuming that the vessels obey the following equations of motion:

[TABLE]

where ${\bm{u}}({\bm{X}}_{t},t)$ is the velocity of the underlying 2D advecting flow, and

[TABLE]

is the control slip velocity of the vessel with fixed intensity $V_{\rm s}$ and varying steering direction ${\bm{n}}$ :

[TABLE]

where the angle is evaluated along the trajectory, $\theta_{t}=\theta({\bm{X}}_{t})$ . We introduce a dimensionless slip velocity by dividing $V_{\rm s}$ with the maximum velocity $u_{\max}$ of the underlying flow:

[TABLE]

In this framework, Zermelo’s problems reduces to optimize the time-space dependency of $\theta$ in order to reach the target Zermelo (1931). For the case when the flow is time independent or its evolution is so slow that it can be considered approximately frozen, a general solution, given by optimal control theory, can be found in Refs. Techy (2011); Mannarini et al. (2016). In particular, assuming that the angle $\theta$ is controlled continuously in time, one can prove that, if there exists an optimal trajectory that joins $\bm{x}_{A}$ with $\bm{x}_{B}$ with a given initial angle $\theta_{t_{0}}$ , the optimal steering angle must satisfy the following time-evolution:

[TABLE]

where $A_{ij}=\partial_{j}u_{i}(\bm{X}_{t})$ is evaluated along the agent trajectory $\bm{X}_{t}$ obtained from Eq. (1). The set of equations (1) together with (3) form a three-dimensional dynamical system, which may result in chaotic dynamics even though the fluid velocity is 2D and time-independent (tracer particles cannot exhibit chaotic dynamics in such flow). Due to the sensitivity to small errors in chaotic systems the ON approach might become useless for many practical applications. Moreover, even in the presence of a global non-positive maximal Lyapunov exponent, where the long time evolution of a generic trajectory is attracted toward fixed points or periodic orbits for almost all initial conditions, the finite time Lyapunov exponents (FTLE) Ott (2002); Vulpiani, Cecconi, and Cencini (2009) can be positive for particular initial conditions and for a time longer than the typical navigation time. In this case, a navigation protocol based on (3) would be unstable for all practical purposes. This is most likely the reason why previous works on ON have dealt mainly with simple advecting flow configurations Techy (2011); Yoo and Kim (2016); Liebchen and Löwen (2019).

III The Reinforcement Learning approach

III.1 Methods

RL applications Sutton and Barto (2018) are based on the idea that an optimal solution for certain complex problems can be obtained by learning from continuous interactions of an agent with its environment. The agent interacts with the environment by sampling its states $s$ , performing actions $a$ and collecting rewards $r$ . In the approach used here, actions are chosen randomly with a probability that is given by the policy function, $\pi(a|s)$ , given the current state $s$ of the surrounding environment. The goal is to find the optimal policy $\pi^{*}(a|s)$ that maximizes the total reward,

[TABLE]

accumulated along one episode, i.e. one trial. To accomplish this goal, RL works in an iterative fashion. Different attempts, or episodes, are performed and the policy is updated to improve the total reward. The initial policy of each episode coincides with the final policy of the previous episode. During the training phase optimality is approached as the total reward for each episode converges (as a function of the number of episodes) to a fixed value (up to stochastic fluctuations).

In our case the vessel acts as the agent and the two-dimensional flow as the environment. As shown in Fig. 1, we define the states by covering the flow domain with square tiles $s_{i}$ with $i=1,2,\dots,N_{s}$ of size $\delta\times\delta$ . Here $N_{s}=900$ and $\delta\sim L/10$ , where $L$ is the large-scale periodicity of the flow. In other words, we suppose that the agent is able to identify its absolute position in the flow within a given approximation determined by the tile size $\delta$ . Furthermore, to be realistic, we allow the agent to sample states and change action only at given time intervals $\Delta t\sim T_{v}/20$ , where $T_{v}$ is the characteristic flow time, $T_{v}=L/u_{\max}$ . The possible actions (steering directions) correspond to the eight angles shown in Fig. 1, namely $\theta_{j}=(j-1)\pi/4$ with $j=1,\dots,8$ . Each episode is defined as one attempt to reach the target, where we make sure that the sum in Eq. (4) is always finite by imposing a maximum time $T_{\max}$ (chosen to be of the order of $10$ times the typical navigation time) after which we terminate the episode. To identify a time-optimal trajectory we use a potential based reward shaping Andrew, Harada, and Russelt (1999) at each time $t$ during the learning process:

[TABLE]

where ${\bm{x}}_{B}$ is the center of the final target region. The first term in the RHS of Eq. (5) is a contribution that accumulate to a large penalty if it takes long for the agent to reach the end point. The second and third terms give the relative improvement in the distance-from-target potential during the training episode, which is known to preserve the optimal policy and help the algorithm to converge faster Andrew, Harada, and Russelt (1999). An episode is finalized when the trajectory reaches the circle of radius $d_{B}\sim 0.9\,\delta$ around the target, $d_{B}$ is roughly $10$ times smaller than the distance between the target and the starting position, see Fig. 1. If an agent does not reach the target within time $T_{\max}$ , or gets as far as $3L$ from the target the episode is ended and a new episode begins. In the latter case the agent receives an extra negative reward equal to $-2T_{\max}$ , in order to strongly penalize these failures. By summation of Eq. (5) over the entire duration of the episode, the total reward (4) becomes

[TABLE]

Eq. (6) is approximately equal to the difference between the time to reach the target without a flow and the actual time taken by the trajectory: $r_{\rm tot}\approx T^{\rm free}_{A\to B}-T_{A\to B}$ , where the free-flight time is defined as

[TABLE]

In order to converge to policies that are robust against small perturbations of the initial condition, which is an important property in the presence of chaos, each episode is started with a uniformly random position within a given radius from the starting point $|{\bm{x}}_{A}-{\bm{X}}_{t_{0}}|<d_{A}\sim 0.4\,\delta$ . Following from our action-state space discretization, with $i=1,\dots,N_{s}$ states and $j=1,\dots,N_{a}$ actions, a natural choice for the policy parametrization is the softmax distribution defined as:

[TABLE]

where $h(s_{i},a_{j},\bm{q})=\sum_{i^{\prime}j^{\prime}}q_{j^{\prime},i^{\prime}}\,\,z_{i^{\prime},j^{\prime}}(s_{i},a_{j})$ is a linear combination of a $N_{a}\times N_{s}$ matrix, $\bm{q}$ , of free parameters. Here we adopt the simplest choice of the feature matrix $z_{i^{\prime},j^{\prime}}(s_{i},a_{j})$ : a perfect non overlapping tiling of the action-state space, $z_{i^{\prime},j^{\prime}}=\delta_{i,i^{\prime}}\delta_{j,j^{\prime}}$ . Unless the matrix of coefficients converges to a singular distribution for each state $s_{i}$ , the softmax expression (8) leads to a stochastic dynamics even for the optimal policy.

During the training phase of the RL protocol, one needs to estimate the expected total future reward (4). In this paper, we follow the one-step actor-critic method Sutton and Barto (2018) based on a gradient ascent in the policy parametrization. The critic approach circumvents the need to generate a big number of trial episodes by introducing the estimation of the the state-value function, $\hat{v}(s_{i},\bm{w})$ ;

[TABLE]

where $y_{i^{\prime}}(s_{i})=\delta_{i^{\prime},i}$ and $w_{i}$ are a set of free parameters (similar to $q_{ji}$ ). The expression $\hat{v}(s_{i},\bm{w})$ is used to estimate the future expected reward, $\hat{r}_{t}$ , in the gradient ascent algorithm:

[TABLE]

Finally, the parameterizations of the policy and the state-value functions are updated every time the state-space is sampled as:

[TABLE]

where $s_{t},a_{t}$ are the $(i,j)$ state-action pairs that are explored at time $t$ during the episode, while $\beta_{t}$ is the future expected reward minus the state-value function, now used as baseline

[TABLE]

The main appeal of the one-step actor-critic algorithm is that replacing the total reward with the one-step return plus the learned state-value function, leads to a fully local -in time- evolution of the gradient ascent. The learning rates $\alpha_{t}$ , $\alpha_{t}^{\prime}$ , follow the Adam algorithm Kingma and Ba (2014) to improve the convergence performance over standard stochastic gradient descent. Both gradients in Eqs. (11) can be computed explicitly and the one-step actor-critic algorithm becomes

[TABLE]

III.2 Results (time-independent flows)

Let us first examine the case when the flow field is time-independent. In Fig. 2 we show the evolution of the total reward as a function of the episode number for problem P1 using $\tilde{V}_{\rm s}=0.5$ . As one can see, the system reaches a stationary state and stable maximum reward, indicating that the RL protocol has converged to a certain policy. In the inset of Fig. 2, we show the trajectories of the vessel following the policies extracted from three different stages during the learning process. This illustrates that the policy evolves slowly toward one that generates stable and short paths joining $A_{1}$ and $B_{1}$ . This is the first result supporting the RL approach.

In Fig. 3 we show examples of trajectories generated with the final policies for both problems P1 and P2 and for different values of $V_{\rm s}$ . For large slip velocities, $\tilde{V}_{\rm s}=0.5$ or $\tilde{V}_{\rm s}=1.0$ , the optimal paths are very close (but not identical) to a straight line connecting the start and end points. For the case with small slip velocity, $\tilde{V}_{\rm s}=0.2$ , the vessel must make use of the underlying flow to reach the target. This is particularly clear for problem P2 where it navigates on very intense flow regions, as can be seen by looking at the correlations between the trajectories in red and the underlying flow intensity in the right panel of Fig. 3. To further illustrate this, we superpose in Fig. 4 the example trajectories of Fig. 3 with the Okubo-Weiss Okubo (1970); Weiss (1991) parameter of the flow (defined as the discriminant of the eigenvalues of the fluid-gradient matrix $\mathbb{A}$ )

[TABLE]

The sign of this parameter determines if the flow is straining (positive) or rotating (negative) and the magnitude determines the degree of strain or rotation. When the slip velocity is small, $\tilde{V}_{\rm s}=0.2$ (red curves), the vessel tends to get attracted to the vicinity of the rotating regions where it exploit the coherent head wind to reach the target quickly.

One of the main results of this paper is connected to the high robustness of the RL solution, especially if compared with the ON (see later in Sect. V). Here we want to show that this property is connected to the fact that the RL optimal policy is the result of a systematic sampling of all regions inside the flow, with information not restricted to the few states that are visited by the shortest trajectories. In the left column of Fig. 5, we show a color coded map for the density of visited states for target P1 (similar results are obtained for P2). As one can see, while there is obviously a high density close to the optimal trajectory, the system has also explored large regions around it, allowing it to also store non-trivial information about neighbouring regions of the optimal trajectory. Similarly, in the right column of Fig. 5 we plot the degree of greediness, defined as the probability of the optimal action for each state:

[TABLE]

in order to have a direct assessment of the randomness in the policy. Close to the optimal trajectory the policy is almost fully deterministic, becoming more and more random as the distance to the optimal trajectory increases.

Finally, we compared the optimal policies found with RL to a trivial policy (TP), where the angle selected at each $\Delta t$ is given by the action that points most directly towards the target among the $8$ different possibilities. In Fig. (6) we show the trajectories optimized with RL and the ones following the TP at different $\tilde{V}_{\rm s}$ for problems P1 and P2, together with the probabilities of arrival times, $T_{A\to B}$ . The TP is able to perform well only when the navigating slip velocity is large, see $\tilde{V}_{\rm s}=1$ for P1. Conversely, for the more interesting case when $\tilde{V}_{\rm s}$ is small, the TP produces many failed attempts (as illustrated by the bars to the right end of the histograms), the arrival times tend to be much longer or infinite because the agents get trapped in recirculating regions from where it is difficult or impossible to exit. The results of the TP are even more bleak in P2, where TP is only successful when $\tilde{V}_{\rm s}=1$ and for the other cases the vessels always get trapped in the flow. In order to quantify the local differences between RL and TP along the optimal trajectories, we show in Fig. 7 the greedy solutions (solutions using the action with the highest probability) selected by the RL in the whole domain for all studied cases together with the probability of the angle mismatch between the greedy RL and TP, $Pr(\theta_{\rm RL}-\theta_{\rm TP})$ , (shown in the small boxes). As one can see, there is always a certain mismatch between the RL and the TP policies, confirming the difficulty to guess apriori the quasi-optimal solutions discovered by the RL.

III.3 Results (time-dependent flows)

In this section we present some results for the more realistic, and more difficult, problem when we relax the assumption that the flow is slow or frozen and we consider Zermelo’s problem for a time-dependent two-dimensional turbulent velocity field. The time evolution is obtained by solving the incompressible Navier-Stokes equations:

[TABLE]

on a periodic square with size $L=2\pi$ with $N=512^{2}$ collocation points, using a pseudo-spectral fully de-aliased code. In (14), $p$ is the pressure, $\nu=1\cdot 10^{-3}$ the viscosity and the forcing mechanism is Gaussian and delta-correlated in time, with support in Fourier space in the window $|{\bf k}|\in[5:6]$ . A stationary statistics is achieved by adding an energy sink at large scale to stop the inverse energy cascade. The averaged spectrum is shown in the inset (a) of Fig. 8.

In Fig. 8 we show the learning curve for Zermelo’s problem connecting the two locations $\bm{x}_{A_{3}}\to\bm{x}_{B_{3}}$ (see Fig. 10 for a graphical summary of the set-up) with propelling velocity $\tilde{V}_{s}=0.2$ . In the inset (b) of Fig. 8 we show the evolution of the kinetic energy, $0.5|{\bf u}(\bm{x},t))|^{2}$ , in the initial and final points for a time duration comparable to that of a typical navigation.

In Fig. 9 we show the PDF of arrival times, similar to Fig. 6, but for the time-dependent turbulent flow. We find that the RL approach clearly outperforms the trivial policy (TP), just as in the time-independent case.

The RL and TP policies are illustrated using six snapshots at different times in Fig. 10. The 6 snapshots show the time evolution of the flow and the growth of two sets of trajectories following learned or trivial policies. A video showing the full evolution can be found in the supplementary material.

Furthermore, in Fig. 11 we show the evolution of the flow kinetic energy along a representative vessel trained with RL compared with the equivalent case following a TP trajectory. As one can see, the RL policy is able to take advantage of the good flow structures to accelerate toward the target and to avoid those that would bring it away, at difference from what happens to the TP which blindly falls inside trapping or accelerating vortical structures.

In conclusion, there are no doubts that RL is able to converge on quasi-optimal navigation policies even in flows with spatio-temporal chaos as the case of multi-scale 2d turbulent flows as here analyzed. Moreover, discovered policies are far from the trivial option to navigate-by-eyes and steering always in the target’s direction.

IV The Optimal Navigation approach

We explore optimal navigation only for the time-independent flow, because then the problem can be solved analytically Bryson and Ho (1975). In this section we implement the approach proposed in Ref. Bryson and Ho (1975) for the time-independent case of Fig. 1.

IV.1 Methods

Eq. (3) gives the evolution of the steering angle that minimizes the time it takes to navigate from $\bm{x}_{A}$ to $\bm{x}_{B}$ provided that the system starts at the optimal initial angle, as was first derived by Zermelo Zermelo (1931). Following Bryson and Ho (1975), this control strategy can be obtained by mapping the problem onto a classical mechanics problem with a Hamiltonian

[TABLE]

where $\bm{p}_{t}=(p^{x}_{t},p^{y}_{t})$ are the generalized momenta of $\bm{X}_{t}$ . The corresponding Hamiltonian dynamics become

[TABLE]

where $\mathbb{A}$ is the fluid gradient matrix with components $A_{ij}=\partial_{j}u_{i}$ evaluated along $\bm{X}_{t}$ . By construction, using the principle of least action, solutions from $\bm{x}_{A}$ at $t=0$ to $\bm{x}_{B}$ at $t=T_{A\to B}$ of this dynamics are extreme points of the action evaluated along a trajectory (1):

[TABLE]

Thus, the trajectory with the optimal time to navigate to the target satisfies the dynamics (15–17). Eq. (17) gives $\bm{p}\propto\bm{n}$ and it follows that the time-optimal control is obtained by solving the joint equations (15) and (16) using $\bm{n}=\bm{p}/|\bm{p}|$ . The corresponding angular dynamics, $\dot{\theta}_{t}={\rm d}\,({\rm atan}(p^{y}_{t}/p^{x}_{t}))/{\rm d}t$ , is identical to Eq. (3). We remark that the dynamics (16) tend to orient the vessels transverse to the maximal stretching directions of the underlying flow. As a consequence, vessels avoid high strain regions and accumulate in vortex regions.

Solutions of Zermelo’s problem deliver the quickest trajectory joining the starting point to the target point for each initial condition. The challenge lies in finding the initial direction that hits the target point in the shortest time. In the set-up described in Section II, the target is not a single point but instead an area. We view this as a continuous family of Zermelo’s problems, where each target point corresponds to a point in the target area with an optimal initial angle and corresponding optimal time. Thus, the optimal path corresponds to the solution of Zermelo’s problem that has the quickest trajectory.

As we shall see, it is not straightforward to find the optimal trajectory by refining the angle $\theta_{t_{0}}$ for the initial condition. The complication is that Zermelo’s dynamics are often unstable in a non-linear flow. This implies that even initial conditions very close to each other may end up at different locations in the flow. As a consequence it is hard to find the optimal strategy: local refinement of $\theta_{t_{0}}$ tend to result in local minima rather than the global one.

IV.2 Results

To study this problem, we integrate Eqs. (3) and (1) numerically using a fourth-order Runge-Kutta scheme with a small time step, $\delta t=10^{-4}$ for 100 time units. We have explicitly checked that reducing the time step to $10^{-5}$ does not change vessel trajectories significantly on the time scales considered in this work. We consider the same targets and values of $V_{\rm s}$ as for the RL. In Fig. 12a we show the results for ON running 100 trajectories with uniformly gridded initial values of $\theta$ between [math] and $2\pi$ and $\tilde{V}_{\rm s}=0.2$ (for target P2). As one can see, the trajectories wander around the flow, all missing the target. Fig. 12b shows a repetition of the experiment for a larger number of trajectories (10000) and only the few trajectories that reach the target are shown (and terminated at the target). In order to empirically understand the stability of the protocol, we identified the optimal initial angle $\theta^{\rm ON}$ as the trajectory in Fig. 12b that reaches the target in the shortest time (red) and run another 10000 trajectories with initial angles in an interval of length $\pi/2500$ around $\theta^{\rm ON}$ (this is the interval of so far unexplored initial conditions around $\theta^{\rm ON}$ ). Fig. 12c shows the evolution of $\theta_{t}$ for those trajectories that reach the target. We observe that there is a wide variability in the time to reach the target and that the trajectories are well mixed in the long run: the order of $\theta_{t}$ does not reflect the initial ordering $\theta_{t_{0}}$ at the time scale when the target is reached. Fig. 12d summarizes which initial angles reach the target and which fail. We observe that successes depend intermittently on the initial angle: continuous bands of successful initial conditions are interdispersed with regions of failures. We also observe that within this sample there exist trajectories that reach the target quicker than the trajectory starting at $\theta^{\rm ON}$ . However, due to the intermittent nature of successes, it is not possible to continuously change $\theta_{t_{0}}$ , starting at $\theta^{\rm ON}$ , to find the best sampled trajectory without passing regions of failure. This highlights the problem in refining the initial condition to find the global optimal trajectory in the ON protocol to the non-linear system considered here.

As shown in Fig. 12, ON is able to produce trajectories that join the starting and ending points. We will compare their times with the ones coming from the RL methods in the next section. Here we instead make a more quantitative analysis of the stability of the ON solutions. In order to do this, we evaluate the FTLE along a phase-space trajectory $(x,y,\theta)$ governed by the system of Eqs. (1) and (3). The FTLE are the local stretching rates, $\lambda_{i}(t)$ for $i=1,2,3$ , of a small phase-space separation $\bm{w}\equiv(x-x^{\prime},y-y^{\prime},\theta-\theta^{\prime})$ . By polar decomposition, we can write $\bm{w}(t)=\mathbb{V}(t)\mathbb{R}(t)\bm{w}(0)$ , where $\mathbb{R}(t)$ and $\mathbb{V}(t)$ are rotation and positive definite stretching matrices. The FTLE $\lambda_{i}(t)$ are defined from the eigenvalues $\exp[t\lambda_{1}(t)]$ , $\exp[t\lambda_{2}(t)]$ and $\exp[t\lambda_{3}(t)]$ of $\mathbb{V}$ . If the maximal FTLE is positive, the ON trajectory is unstable and small deviations from the trajectory are exponentially amplified with time. In Fig. 13, we show results of the FTLE for the quickest trajectories reaching the target using ON, i.e. trajectories starting at $\theta^{\rm ON}$ as defined in Fig. 12b. We find that the maximal FTLE is positive for the time it takes to reach the target for all velocities $\tilde{V}_{\rm s}$ considered. On the other hand, for times large enough, many trajectories approach fixed-point attractors, leading to a smooth decay towards negative FTLE. This effect becomes more evident the smaller the navigating velocity $V_{\rm s}$ is. For $V_{\rm s}\rightarrow 0$ the system of equations (1) decouples from Eq. (3) and the dynamics cease to be sensitive to small perturbations.

In conclusion, Fig. 13 shows that the system is unstable to small perturbations on the time scales needed to reach the target. This explains the observed behaviour of the trajectories in Fig. 12 and why the ON approach is untractable for practical applications.

V Comparison between RL and ON

In this section we make a side-by-side comparison of the RL and ON approaches. To better highlight the RL stability compared to the ON solution we need to specify how the two sets of simulations are initialized. While the RL initial conditions are chosen in a circle of radius $d_{A_{i}}$ centered at $\bm{x}_{A_{i}}$ with a small spread in the initial angles (typically the probability of the non-greedy actions which is in the initial state of the order of $0.01$ ), we initialize the ON simulations at the exact spatial starting point, $\bm{x}_{A_{i}}$ , and with uniformly distributed initial angles in an interval of length $0.02$ around $\theta^{\rm ON}$ as defined in Fig. 12b. The ON approach could not be initialized starting in a box of side length $d_{A_{i}}=0.2$ because its unstable dynamics prevents it to work if the range of initial conditions is too wide.

We find that the minimum time taken by the best trajectory to reach the target is of the same order for the two methods. The main difference between RL and ON lies instead in their robustness. We illustrate this by plotting the spatial density of trajectories in the left column of Fig. 14 for the optimal policies of ON and RL with three values of $\tilde{V}_{\rm s}$ and initialized as described above. We observe that the RL trajectories (blue colour area) follow a coherent region in space, while the ON trajectories (red colour area) fill space almost uniformly, especially for large values of $\tilde{V}_{\rm s}$ . Moreover, for small navigation velocities, many trajectories in the ON system approach regular attractors, as visible by the high-concentration regions. Similar results are found for the trajectories following problem P1 (not shown).

The right column of Fig. 14 shows a comparison between the probability of arrival times for the trajectories illustrated in the left column. This provides a quantitative estimation of the better robustness of RL compared to ON: even though the ON best time is comparable or, sometimes, even slightly smaller than the RL minimum time, the ON probability has a much wider tail towards large arrival times and it is always characterized by a much larger number of failures. All these results highlights a strong instability of the ON approach.

VI Conclusions

We have presented a first systematic investigation of Zermelo’s time-optimal navigation problem in a realistic 2D turbulent flow. We have developed a RL approach based on an actor-critic algorithm, which is able to find quasi-optimal discretized policies with strong robustness against changing of the initial condition. In particular, we have considered constant navigation speeds, $V_{\rm s}$ , not exceeding the maximum flow velocity $u_{\max}$ , down to values of $V_{\rm s}\sim 0.2\,u_{\max}$ and for all cases we successfully identified quick navigation paths and close to optimal policies that are strongly different from the trivial choice to navigate towards the target at all times. We have also implemented a few attempts with an additive noise in the equations for the vessel evolution and found that RL is able to reach a solution even for this case (results will be reported elsewhere). Furthermore, we investigated the relation between the optimal paths and the underlying topological properties of the flow, identifying the role played by coherent structures to guide the vessel towards the target. Finally, for the time-independent flow we have compared RL with the Optimal Navigation approach showing that the latter exhibits a strong sensitivity on the initial conditions and is thus inadequate for real-world applications. Many potential applications can be envisaged.

It is important to stress that the RL approach implemented here requires information about the flow evolution for the duration of the optimal trajectory. In realistic problems of travel planning, this information must be provided by a model of the evolution of the flow. Furthermore, each flow evolution and each couple of starting/arrival points requires individual optimization, it is highly unlikely that a solution to one optimization problem can be robustly applied to other situations. In this work we implemented RL using an actor-critic structure because of its flexibility to be extended to Deep-RL approaches Mnih et al. (2015); Zhu et al. (2017); Arulkumaran et al. (2017); Novati, Mahadevan, and Koumoutsakos (2018). Other RL algorithms, such as Q-learning Sutton and Barto (2018), could also be implemented. Our approach is based on repeated trials for an important reason: the objective of Zermelo’s problem is to find the trajectory that connects two points in the minimal time, not just any trajectory. For this reason we must be able to run different iterations and find the minimum (or at least a local minimum). Nonetheless, if we change our objective to just connecting two points, our RL approach can be extended to a fully online one where only one agent swims around the flow continuously until it reaches the target. We plan to tackle that problem in future research. For future work, it is also key to probe the efficiency of the different approaches considered here for 3D geometries, where already the simple uncontrolled tracer dynamics, $V_{s}=0$ , can be chaotic even in time-independent flows Dombre et al. (1986); Bohr et al. (2005), opening new challenges for the optimization problem. Moreover, similar optimal navigation problems can be reformulated for inertial particles Toschi and Bodenschatz (2009); Gibert, Xu, and Bodenschatz (2012); Bec et al. (2007); Mordant et al. (2001), where the control is moved to the acceleration with important potential applications to buoyant geophysical probes Roemmich et al. (2009). Work in these directions is in progress and will be reported elsewhere.

VII Supplementary material

Please see supplementary material for a movie showing the side-by-side evolution of two bunches of trajectories evolved following the optimal RL or the trivial policy in a time-dependent 2d turbulent flows: https://www.fisica.uniroma2.it/~biferale/MOVIES/ZermeloMovie.arXiv1907.08591.mp4

VIII Acknowledgments

We acknowledge A. Celani for useful comments. L.B., M.B. and P.C.d.L. acknowledge funding from the European Union Programme (FP7/2007-2013) grant No.339032. K.G. acknowledges funding from the Knut and Alice Wallenberg Foundation, Dnar. KAW 2014.0048.

Bibliography62

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Petres et al. (2007) C. Petres, Y. Pailhas, P. Patron, Y. Petillot, J. Evans, and D. Lane, “Path planning for autonomous underwater vehicles,” IEEE Transactions on Robotics 23 , 331–341 (2007).
2Witt and Dunbabin (2008) J. Witt and M. Dunbabin, “Go with the flow: Optimal auv path planning in coastal environments,” in Australian Conference on Robotics and Automation , Vol. 2008 (2008).
3Kraus (2012) N. D. Kraus, Wave glider dynamic modeling, parameter identification and simulation , Ph.D. thesis, [Honolulu]:[University of Hawaii at Manoa],[May 2012] (2012).
4Smith et al. (2011) R. N. Smith, J. Das, G. Hine, W. Anderson, and G. S. Sukhatme, “Predicting wave glider speed from environmental measurements,” in OCEANS’11 MTS/IEEE KONA (IEEE, 2011) pp. 1–8.
5Lumpkin and Pazos (2007) R. Lumpkin and M. Pazos, “Measuring surface currents with surface velocity program drifters: the instrument, its data, and some recent results,” Lagrangian analysis and prediction of coastal and ocean dynamics , 39–67 (2007).
6Niiler (2001) P. Niiler, “.1 the world ocean surface circulation,” in International Geophysics , Vol. 77 (Elsevier, 2001) pp. 193–204.
7Lermusiaux et al. (2017) P. F. Lermusiaux, D. Subramani, J. Lin, C. Kulkarni, A. Gupta, A. Dutt, T. Lolla, P. Haley, W. Ali, C. Mirabito, et al. , “A future for intelligent autonomous ocean observing systems,” Journal of Marine Research 75 , 765–813 (2017).
8Bechinger et al. (2016) C. Bechinger, R. Di Leonardo, H. Löwen, C. Reichhardt, G. Volpe, and G. Volpe, “Active particles in complex and crowded environments,” Reviews of Modern Physics 88 , 045006 (2016).