Generalizable control for quantum parameter estimation through   reinforcement learning

Han Xu; Junning Li; Liqiang Liu; Yu Wang; Haidong Yuan; Xin Wang

arXiv:1904.11298·quant-ph·April 29, 2021

Generalizable control for quantum parameter estimation through reinforcement learning

Han Xu, Junning Li, Liqiang Liu, Yu Wang, Haidong Yuan, Xin Wang

PDF

1 Repo

TL;DR

This paper demonstrates that reinforcement learning can efficiently identify and generalize quantum control strategies to enhance parameter estimation precision, outperforming traditional methods.

Contribution

It introduces reinforcement learning as a novel, efficient, and highly generalizable approach for quantum control in parameter estimation tasks.

Findings

01

Reinforcement learning improves quantum parameter estimation precision.

02

Neural networks trained on one parameter value generalize across a broad range.

03

RL-based controls outperform conventional optimal control methods.

Abstract

Measurement and estimation of parameters are essential for science and engineering, where one of the main quests is to find systematic schemes that can achieve high precision. While conventional schemes for quantum parameter estimation focus on the optimization of the probe states and measurements, it has been recently realized that control during the evolution can significantly improve the precision. The identification of optimal controls, however, is often computationally demanding, as typically the optimal controls depend on the value of the parameter which then needs to be re-calculated after the update of the estimation in each iteration. Here we show that reinforcement learning provides an efficient way to identify the controls that can be employed to improve the precision. We also demonstrate that reinforcement learning is highly generalizable, namely the neural network trained…

Tables1

Table 1. Supplementary Table S-I: The hyper-parameters for A3C and A3C with PPO strategy

Hyper-parameter (A3C)	Value	Hyper-parameter (“A3C+PPO")	Value
RMSProp Learning rate	$10^{- 5}$	Adam Learning rate	$2 \times 10^{- 4}$
Reward decay factor $(α)$	0.99	Reward decay factor $(α)$	0.9
Entropy weight $(η)$	$10^{- 4}$	Entropy weight $(η)$	$10^{- 3}$
Batch size	$N$ , $T / Δ T$	Batch size	$N$ , $T / Δ T$
$C$ , in reward function	10	$C$ , in reward function	10
$η$ , in reward function	1.001	$η$ , in reward function	1.001
Maximum gradient norm	40	Maximum gradient norm	40
Maximum amplitudes $({\| u_{k} \|}_{\max})$	4	Maximum amplitudes $({\| u_{k} \|}_{\max})$	4
		PPO clipping $ϵ$	0.12
		Num. PPO steps, $N_{\max}^{ppo}$	10

Equations47

\hat{H} (t) = \hat{H}_{0} (ω) + k = 1 \sum p u_{k} (t) \hat{H}_{k},

\hat{H} (t) = \hat{H}_{0} (ω) + k = 1 \sum p u_{k} (t) \hat{H}_{k},

\partial_{t} \overset{ρ}{^} (t) = - i [\hat{H} (t), \overset{ρ}{^} (t)] + Γ [\overset{ρ}{^} (t)],

\partial_{t} \overset{ρ}{^} (t) = - i [\hat{H} (t), \overset{ρ}{^} (t)] + Γ [\overset{ρ}{^} (t)],

F (t) = Tr [\overset{ρ}{^} (t) \hat{L}_{s}^{2} (t)],

F (t) = Tr [\overset{ρ}{^} (t) \hat{L}_{s}^{2} (t)],

\partial_{t} \overset{ρ}{^} (t) = - i [\hat{H} (t), \overset{ρ}{^} (t)] + \frac{γ}{2} [\overset{σ}{^}_{n} \overset{ρ}{^} (t) \overset{σ}{^}_{n} - \overset{ρ}{^} (t)],

\partial_{t} \overset{ρ}{^} (t) = - i [\hat{H} (t), \overset{ρ}{^} (t)] + \frac{γ}{2} [\overset{σ}{^}_{n} \overset{ρ}{^} (t) \overset{σ}{^}_{n} - \overset{ρ}{^} (t)],

\hat{H} (t) = \frac{1}{2} ω_{0} \overset{σ}{^}_{3} + u (t) \cdot σ,

\hat{H} (t) = \frac{1}{2} ω_{0} \overset{σ}{^}_{3} + u (t) \cdot σ,

⟨ F (T) / T ⟩ = \frac{1}{2Δ ω} \int_{1 - Δ ω}^{1 + Δ ω} F (T) / T d ω .

⟨ F (T) / T ⟩ = \frac{1}{2Δ ω} \int_{1 - Δ ω}^{1 + Δ ω} F (T) / T d ω .

\partial_{t} \overset{ρ}{^} (t) = - i [\hat{H} (t), \overset{ρ}{^} (t)] + γ_{+} [\overset{σ}{^}_{+} \overset{ρ}{^} (t) \overset{σ}{^}_{-} - \frac{1}{2} {\overset{σ}{^}_{-} \overset{σ}{^}_{+}, \overset{ρ}{^} (t)}] + γ_{-} [\overset{σ}{^}_{-} \overset{ρ}{^} (t) \overset{σ}{^}_{+} - \frac{1}{2} {\overset{σ}{^}_{+} \overset{σ}{^}_{-}, \overset{ρ}{^} (t)}],

\partial_{t} \overset{ρ}{^} (t) = - i [\hat{H} (t), \overset{ρ}{^} (t)] + γ_{+} [\overset{σ}{^}_{+} \overset{ρ}{^} (t) \overset{σ}{^}_{-} - \frac{1}{2} {\overset{σ}{^}_{-} \overset{σ}{^}_{+}, \overset{ρ}{^} (t)}] + γ_{-} [\overset{σ}{^}_{-} \overset{ρ}{^} (t) \overset{σ}{^}_{+} - \frac{1}{2} {\overset{σ}{^}_{+} \overset{σ}{^}_{-}, \overset{ρ}{^} (t)}],

u^{(j)} (t) = A^{(j)} exp {- [(t - t^{(j)}) / σ^{g, (j)}]^{2}},

u^{(j)} (t) = A^{(j)} exp {- [(t - t^{(j)}) / σ^{g, (j)}]^{2}},

R_{j} = k = 1 \sum \infty α^{k - 1} r_{j + k},

R_{j} = k = 1 \sum \infty α^{k - 1} r_{j + k},

Q^{π} (s, a) = E [R_{j} ∣ s_{j} = s, a_{j} = a],

Q^{π} (s, a) = E [R_{j} ∣ s_{j} = s, a_{j} = a],

V^{π} (s) = E [R_{j} ∣ s_{j} = s],

V^{π} (s) = E [R_{j} ∣ s_{j} = s],

Q^{*} (s, a; θ_{v}^{*})

Q^{*} (s, a; θ_{v}^{*})

V^{*} (s; θ_{v}^{*})

Q^{*} (s, a; θ_{v}^{*})

Q^{*} (s, a; θ_{v}^{*})

V^{*} (s; θ_{v}^{*})

L_{Q} = [R_{j}^{n} + α^{n} a^{'} max Q^{π} (s_{j + n}, a) - Q^{π} (s_{j}, a)]^{2},

L_{Q} = [R_{j}^{n} + α^{n} a^{'} max Q^{π} (s_{j + n}, a) - Q^{π} (s_{j}, a)]^{2},

L_{V} = [R_{j}^{n} + α^{n} V^{π} (s_{j + n}) - V^{π} (s_{j})]^{2},

L = - j \sum lo g (π_{θ} (a_{j} ∣ s_{j})) A_{j},

L = - j \sum lo g (π_{θ} (a_{j} ∣ s_{j})) A_{j},

A_{j} = R_{j} - b (s_{j}),

A_{j} = R_{j} - b (s_{j}),

s_{j} = (Re (\overset{ρ}{^}_{00}), Im (\overset{ρ}{^}_{00}), Re (\overset{ρ}{^}_{10}), Im (\overset{ρ}{^}_{10}), Re (\overset{ρ}{^}_{01}), Im (\overset{ρ}{^}_{01}), Re (\overset{ρ}{^}_{11}), Im (\overset{ρ}{^}_{11})) .

s_{j} = (Re (\overset{ρ}{^}_{00}), Im (\overset{ρ}{^}_{00}), Re (\overset{ρ}{^}_{10}), Im (\overset{ρ}{^}_{10}), Re (\overset{ρ}{^}_{01}), Im (\overset{ρ}{^}_{01}), Re (\overset{ρ}{^}_{11}), Im (\overset{ρ}{^}_{11})) .

r_{j+1}=\left\{\begin{array}[]{ll}\frac{F(j+1)-\eta F_{0}(j+1)}{F_{0}(j+1)},&j+1<N,\\ \frac{F(j+1)-\eta F_{0}(j+1)}{F_{0}(j+1)}\times C,&j+1=N,\end{array}\right.

r_{j+1}=\left\{\begin{array}[]{ll}\frac{F(j+1)-\eta F_{0}(j+1)}{F_{0}(j+1)},&j+1<N,\\ \frac{F(j+1)-\eta F_{0}(j+1)}{F_{0}(j+1)}\times C,&j+1=N,\end{array}\right.

d θ \leftarrow d θ + \partial min (ν_{j} (θ) A_{j}, clip (ν_{j} (θ), 1 - ϵ, 1 + ϵ) A_{j}) / \partial θ

d θ \leftarrow d θ + \partial min (ν_{j} (θ) A_{j}, clip (ν_{j} (θ), 1 - ϵ, 1 + ϵ) A_{j}) / \partial θ

d θ_{v} \leftarrow d θ_{v} + \partial A_{j}^{2} / \partial θ_{v}

d θ_{v} \leftarrow d θ_{v} + \partial A_{j}^{2} / \partial θ_{v}

A_{j} = R_{j}^{n} + α^{n} V^{π} (s_{j + n}) - V^{π} (s_{j}) .

A_{j} = R_{j}^{n} + α^{n} V^{π} (s_{j + n}) - V^{π} (s_{j}) .

ν_{j} (θ) = \frac{π _{θ} ( a _{j} ∣ s _{j} )}{π _{θ_{old}} ( a _{j} ∣ s _{j} )},

ν_{j} (θ) = \frac{π _{θ} ( a _{j} ∣ s _{j} )}{π _{θ_{old}} ( a _{j} ∣ s _{j} )},

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

MilCOS/Quantum_Parameter_Estimation_with_RL
pytorch

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Generalizable control for quantum parameter estimation through reinforcement learning

Han Xu

Department of Physics, City University of Hong Kong, Tat Chee Avenue, Kowloon, Hong Kong SAR, China, and City University of Hong Kong Shenzhen Research Institute, Shenzhen, Guangdong 518057, China

School of Physics and Technology, Wuhan University, Wuhan 430072, China

Junning Li

Department of Physics, City University of Hong Kong, Tat Chee Avenue, Kowloon, Hong Kong SAR, China, and City University of Hong Kong Shenzhen Research Institute, Shenzhen, Guangdong 518057, China

Liqiang Liu

Department of Mechanical and Automation Engineering, The Chinese University of Hong Kong, Shatin, Hong Kong SAR, China

Yu Wang

School of Physics and Technology, Wuhan University, Wuhan 430072, China

Haidong Yuan

[email protected]

Department of Mechanical and Automation Engineering, The Chinese University of Hong Kong, Shatin, Hong Kong SAR, China

Xin Wang

[email protected]

Department of Physics, City University of Hong Kong, Tat Chee Avenue, Kowloon, Hong Kong SAR, China, and City University of Hong Kong Shenzhen Research Institute, Shenzhen, Guangdong 518057, China

Abstract

Measurement and estimation of parameters are essential for science and engineering, where one of the main quests is to find systematic schemes that can achieve high precision. While conventional schemes for quantum parameter estimation focus on the optimization of the probe states and measurements, it has been recently realized that control during the evolution can significantly improve the precision. The identification of optimal controls, however, is often computationally demanding, as typically the optimal controls depend on the value of the parameter which then needs to be re-calculated after the update of the estimation in each iteration. Here we show that reinforcement learning provides an efficient way to identify the controls that can be employed to improve the precision. We also demonstrate that reinforcement learning is highly generalizable, namely the neural network trained under one particular value of the parameter can work for different values within a broad range. These desired features make reinforcement learning an efficient alternative to conventional optimal quantum control methods.

I Introduction

Metrology, which studies high precision measurement and estimation, has been one of the main driving forces in science and technology. Recently, quantum metrology, which uses quantum mechanical effects to improve the precision, has gained increasing attention for its potential applications in imaging and spectroscopy kolobov1999spatial ; lugiato2002quantum ; morris2015imaging ; roga2016security ; tsang2016quantum ; Giovannetti2011 .

One of the main quests in quantum metrology is to identify the highest precision that can be achieved with given resources. Typically the desired parameter, $\omega$ , is encoded in a dynamics ${\it\Lambda}_{\omega}$ . After an initial probe state $\rho_{0}$ is prepared, the parameter is encoded in the output state as $\rho_{\omega}={\it\Lambda}_{\omega}(\rho_{0})$ . Proper measurements on the output state then reveals the value of the parameter. To achieve the highest precision, one needs to optimize the probe states, the controls during the dynamics and the measurements on the output states. Previous studies have been mostly focused on the optimization of the probe states and measurements Giovannetti2011 . The control only starts to gain attention recently yuan2015optimal ; yuan2016sequential ; Pang2017a ; Pang2017b ; Liu2017 ; Liu2017a ; Yang2017 ; Naghiloo2017 ; Sekatski2017 ; Braun2017 ; degen2017 ; braun2018 . It has now been realized that properly designed controls can significantly improve the precision limits. The identification of optimal controls, however, is often highly complicated and time-consuming. This issue is particularly severe in quantum parameter estimation, as typically optimal controls depend on the value of the parameter, which can only be estimated from the measurement data. When more data are collected, the optimal controls also need to be updated, which is conventionally achieved by another run of the optimization algorithm. This creates a high demand for the identification of efficient algorithms to find the optimal controls in quantum parameter estimation.

Over the past few years, machine learning has demonstrated astonishing achievements in certain high-dimensional input-output problems, such as playing video games Mnih2015 and mastering the game of Go Silver2016 . Reinforcement Learning (RL) sutton2018 is one of the most basic yet powerful paradigms of machine learning. In RL, an agent interacts with an environment with certain rules and goals set forth by the problem desired. By trial and error, the agent optimizes its strategy to achieve the goals, which is then translated to a solution to the problem. RL has been shown to provide improved solutions to many problems related to quantum information science, including quantum state transfer Zhang2018 , quantum error correction Fosel2018 , quantum communication Wallnofer2019 , quantum control Bukov2018 ; Niu2019 ; An2019 and experiment design Melnikov2018 .

Here we show that RL serves as an efficient alternative to identify controls that are helpful in quantum parameter estimation. A main advantage of RL is that it is highly generalizable, i.e., the agent trained through RL under one value of the parameter works for a broad range of the values. There is then no need for re-training after the update of the estimated value of the parameter from the accumulated measurement data, which makes the procedure less resource-consuming under certain situations.

II Results

We consider a generic control problem described by the Hamiltonian Khaneja2005 :

[TABLE]

where $\hat{H}_{0}$ is the time-independent free evolution of the quantum state, $\omega$ the parameter to be estimated, $u_{k}(t)$ the $k$ th time-dependent control field, $p$ the dimensionality of the control field, and $\hat{H}_{k}$ couples the control field to the state.

The density operator of a quantum state (pure or mixed) evolves according to the master equation breuer2002theory ,

[TABLE]

where ${\it\Gamma}[\hat{\rho}(t)]$ indicates a noisy process, the detailed form of which depends on the specific noise mechanism and will be detailed later.

The key quantity in quantum parameter estimation is the QFI helstrom1976quantum ; Holevo ; Petz2010 ; Braunstein1994 , defined by

[TABLE]

where $\hat{L}_{s}(t)$ is the so-called symmetric logarithmic derivative that can be obtained by solving the equation $\partial_{\omega}\hat{\rho}(t)=\frac{1}{2}\left[\hat{\rho}(t)\hat{L}_{s}(t)+\hat{L}_{s}(t)\hat{\rho}(t)\right]$ helstrom1976quantum ; Holevo ; Braunstein1996 . According to the Cramér-Rao bound, the QFI provides a saturable lower bound on the estimation as $\delta\hat{\omega}\geq\frac{1}{\sqrt{nF(t)}}$ , where $\delta\hat{\omega}=\sqrt{E[(\hat{\omega}-\omega)^{2}]}$ is the standard deviation of an unbiased estimator $\hat{\omega}$ , and $n$ is the number of times the procedure is repeated. Our goal is therefore to search for optimal control sequences $u_{k}(t)$ that maximize the QFI at time $t=T$ (typically the conclusion of the control), $F(T)$ , respecting all constraints possibly imposed in specific problems. Practically, we consider piecewise constant controls so the total evolution time $T$ is discretized into $N$ steps with equal length $\Delta T$ labeled by $j$ , and we use $u_{k}^{(j)}$ to denote the strength of the control field $u_{k}$ on the $j$ th time step. Researches of such problem are frequently tackled by the Gradient Ascent Pulse Engineering (GRAPE) method Khaneja2005 , which searches for an optimal set of control fields by updating their values according to the gradient of a cost function encapsulating the goal of the optimal control. It has been found that GRAPE is successful in preparing optimal control pulse sequences that improve the precision limit of quantum parameter estimation in noisy processes Liu2017 ; Liu2017a . Many alternative algorithms can tackle this optimization problem such as the stochastic gradient ascent(descent) method and microbial genetic algorithm Harvey2011 , but the convergence to the optimal control fields becomes much slower when the dimensionality ( $p$ ) of the control field or the discretization steps ( $N$ ) increases. Other optimal quantum control algorithms, such as Krotov’s method Sklarz2002 ; Palao2003 ; Machnes2011 ; Reich2012 ; Goerz2015 and CRAB algorithm Doria2011 , typically depend on the value of the parameter, thus need to be run repeatedly along the update of the estimation, which is highly time-consuming. More efficient algorithms are thus highly desired.

In this work, we employ RL to solve the problem and compare the results to GRAPE. Our implementation of GRAPE follows Ref. Liu2017 . Figure 1 shows schematics of the RL procedure and the Actor-Critic algorithm sutton2018 used in this work. In order to improve the efficiency of computation, we used a parallel version of the Actor-Critic algorithm called Asynchronous Advantage Actor-Critic (A3C) algorithm Mnih2016 . For more extensive reviews of RL, Actor-Critic algorithm and A3C, see Methods and the Supplementary Methods.

Next we apply the algorithm to two commonly considered noisy processes: dephasing and spontaneous emission, to demonstrate the effect of the algorithm.

II.1 Dephasing Dynamics

Under dephasing dynamics, the master equation, Eq. (2), takes the following form Liu2017 :

[TABLE]

where

[TABLE]

the control field $\mathbf{u}(t)=(u_{1},u_{2},u_{3})$ is a magnetic field that couples to ${\bm{\sigma}}=(\hat{\sigma}_{1},\hat{\sigma}_{2},\hat{\sigma}_{3})$ , and $\gamma$ is the dephasing rate which is taken as 0.1 throughout the paper. We consider a dephasing along a general direction given by ${\mathbf{n}}=(\sin{\vartheta}\cos{\phi},\sin{\vartheta}\sin{\phi},\cos{\vartheta})$ , $\hat{\sigma}_{\mathbf{n}}={\mathbf{n}}\cdot{\bm{\sigma}}$ . The parameter to be estimated is $\omega_{0}$ in Eq. (5), the true value of which is assumed to be 1, and we take $\omega_{0}^{-1}=1$ as our time unit. We choose the probe state, i.e. the initial state of the evolution, as $(|0\rangle+|1\rangle)/\sqrt{2}$ in all subsequent calculations, where $|0\rangle,|1\rangle$ are the eigenstates of $\hat{\sigma}_{3}$ .

In Fig. 2 we present our numerical results on QFI under dephasing dynamics with $\vartheta={\mathrm{\pi}}/4$ , $\phi=0$ using square pulses. Figure 2a-c show the results for $\Delta T=0.1$ . Figure 2a shows the training process in terms of $F(T)/T$ as functions of the number of training epochs. The blue line shows results from the training using A3C algorithm. The value of $F(T)/T$ corresponding to results from GRAPE and the case with no control are shown as the orange dotted line and grey dashed line, respectively. The red line shows results from “A3C+PPO”, an enhanced version of A3C which converges faster Schulman2017 . The details of this algorithm is explained in the Supplementary Methods. We can see that after sufficient training epochs, results from A3C exceed that for the case with no control, and approaches the optimal results found by GRAPE. On the other hand, “A3C+PPO” converges more quickly to essentially the same result of A3C.

We select one training outcome from those with best performances in Fig. 2a and show $F(t)/t$ and the pulse profiles in Fig. 2b, c respectively. As can be seen from Fig. 2b, both GRAPE and A3C outperform the case with no control, while the results of A3C are comparable to those from GRAPE.

Figure 2d-f show results with a larger time step, $\Delta T=1$ . From the training results shown in Fig. 2d, we see that results from A3C occasionally exceed those from GRAPE, for example at training epoch approximately 1600 and 3000. $F(t)/t$ and the pulse profile of one of the best performing results is again shown in Fig. 2e and f, and we see from Fig. 2e that A3C indeed outperforms GRAPE in this case.

We have discussed dephasing dynamics along a particular axis pertaining to Fig. 2, and the results for several other dephasing axes are shown in the Supplementary Discussion. We conclude from these results that in most cases, the A3C algorithm is capable to produce results comparable to those from GRAPE, while in selected situations (e.g. larger $\Delta T$ ) A3C may outperform GRAPE.

We now discuss the generalizability of the control sequences for quantum parameter estimation, a key result of this paper. Since the true value of $\omega_{0}$ is not known a priori, the control sequence has to be found optimal for a chosen $\omega_{0}$ . When such sequence is applied in situations under other $\omega_{0}$ values, the true value is still measured, but the resulting QFI is lower than when the optimal control for true $\omega_{0}$ is used. In order to raise the QFI, one must then perform a second measurement using control sequences optimized for the estimated true value of $\omega_{0}$ . The entire procedure therefore involves two steps, using different pulse sequences. This is fundamentally different than other typical measurements in quantum control, e.g. evaluation of fidelities of quantum gates Goerz2014 , for which there is no need for a second pulse sequence or a second measurement.

The dotted lines in the left column of Fig. 3 show the QFI resulting from measurements with the optimal control found for $\omega_{0}=1$ with GRAPE. Results without control are shown as grey dashed lines for comparison. The range of $\omega_{0}$ covers a period of $2{\mathrm{\pi}}/T$ . As expected, the QFI is largest at $\omega_{0}=1$ , but reduces as $\omega_{0}$ deviates from 1. As $\omega_{0}$ further varies, the QFI increases at some values of $\omega_{0}$ which may be due to the geometric relationship of the phase that corresponding to those $\omega_{0}$ values and the phase at $\omega_{0}=1$ . In any case, these QFI values are consistently lower than the value at $\omega_{0}=1$ . An obvious way to improve the QFI is to generate new optimal control sequences for each value of $\omega_{0}$ from GRAPE, but this is costly as the computational complexity scales as ${\cal O}(N^{3})$ . A detailed discussion on the computational complexity can be found in Supplementary Discussion.

With A3C we have an efficient solution to this problem. We can train the neural network at $\omega_{0}=1$ , and use this particular network to generate control sequences for different $\omega_{0}$ values. The neural network is only trained at $\omega_{0}=1$ . However, the trained neural network works for a broad range of parameter values. There is no need to re-train the neural network with the updated estimation of the parameter. The computational cost is thus simply ${\cal O}(N)$ so it is much more efficient than generating new sequences with GRAPE. These results from A3C are shown in the left column of Fig. 3 as blue solid lines which represents the best-performing sequence from 100 trials generated from the trained neural network. For $\Delta T=0.1$ (Fig. 3a), although the QFI in the training $\omega_{0}=1$ is slightly lower for A3C than that of GRAPE, A3C demonstrates higher generalizability as the QFI deceases slowly when $\omega_{0}$ deviates from 1. For $\Delta T=1$ (Fig. 3c), the QFI of A3C is consistently higher than GRAPE except a narrow range of $\omega_{0}$ around 0.65.

To further reveal the generalizability of different methods, we consider the measurement in an ensemble with $\omega_{0}$ uniformly distributed in $[1-\Delta\omega,1+\Delta\omega]$ . The performance of the quantum parameter estimation is therefore given by the average $F(T)/T$ ,

[TABLE]

These results are shown in the right column of Fig. 3, which are averages of the data in the corresponding panels in the left column. As seen from Fig. 3b ( $\Delta T=0.1$ ), $\langle F(T)/T\rangle$ for GRAPE is high at small $\Delta\omega$ but drops quickly as $\Delta\omega$ is increased. On the contrary, $\langle F(T)/T\rangle$ for A3C is lower than that for GRAPE at small $\Delta\omega$ , but decays much more slowly. As a consequence, $\langle F(T)/T\rangle$ for A3C exceeds that for GRAPE beyond $\Delta\omega\gtrsim 0.22$ . This result indicates that for measurements involving a reasonably varying parameter, A3C demonstrates higher generalizability. For $\Delta T=1$ , the results of A3C always exceed GRAPE as seen from Fig. 3d. The result for A3C decays much more slowly than that for GRAPE, in consistency with the $\Delta T=0.1$ case.

Intuitively without control and noise, the optimal strategy is preparing the initial probe state as $(|0\rangle+|1\rangle)/{\sqrt{2}}$ , since this state has the fastest rate of rotations under the Hamiltonian. Since the evolution of the state is also affected by dephasing, competitions exist between the parametrization and the effect of noise. When the evolution time is short, the parametrization dominates, in which case the control does not help much. However, in experimentally relevant situations the evolution time is typically long enough for noises to dominate. The controls are therefore useful as they can steer the states to regions where those states are less affected by the noise, even if such states may have a slower speed of parametrization. GRAPE and RL-based methods are both systematical ways to find controls, however, as we have demonstrated, A3C is more generalizable.

II.2 Spontaneous Emission

A process involving the spontaneous emission is described by the Lindblad master equation Liu2017 :

[TABLE]

where $\hat{\sigma}_{\pm}=(\hat{\sigma}_{1}\pm i\hat{\sigma}_{2})/2$ and $\hat{H}$ is defined as Eq. (5). The relaxation rates are taken as $\gamma_{+}=0.1,\gamma_{-}=0$ throughout our discussion.

Figure 4 shows numerical results on QFI with spontaneous emission. Figure 4a-c are for $\Delta T=0.1$ , $T=10$ , and Fig. 4d-f show calculations with a larger time step $\Delta T=1$ , $T=20$ . Figure 4a, d [left column] show the A3C training processes, in which the results from GRAPE are indicated as orange dotted line for reference. We see that “A3C+PPO" converges faster, and both A3C and “A3C+PPO" saturate to values slightly lower than GRAPE. Again, one of the best performing control is picked out and the corresponding $F(t)/t$ and pulse profiles are shown in the middle and right column respectively. From Fig. 4b, e we see that for the best result from A3C, the QFI is lower than, but comparable to results from GRAPE.

As in the case of dephasing dynamics, we consider the generalizability of different methods in a situation involving $\omega_{0}$ that distributes uniformly in a range. Again, we use GRAPE to obtain optimal control sequences for $\omega_{0}=1$ and apply that to other values. For A3C, we trained the neural network at $\omega_{0}=1$ ; the resulting sequence is then used to obtain an estimate of the true $\omega_{0}$ value. A new sequence is then generated using the neural network already trained at $\omega_{0}=1$ with the estimated $\omega_{0}$ . The best-performing results out of 100 A3C outputs are shown as the blue solid lines in Fig. 5, while the results from GRAPE are shown as the orange dotted lines. The left column of Fig. 5 shows $F(T)/T$ as functions of $\omega_{0}$ for two $\Delta T$ values. In both cases, the GRAPE method outperforms A3C in a narrow neighborhood around $\omega_{0}=1$ , but its QFI decreases substantially as $\omega_{0}$ further deviates. On the other hand, A3C exhibits great generalizability: for $\Delta T=0.1$ the QFI does not decrease until $\omega_{0}$ is reduced to $\omega_{0}\lesssim 0.6$ , while for $\Delta T=1$ the QFI remains approximately the same for the entire range of $\omega_{0}$ considered. The average $F(T)/T$ in the range $[1-\Delta\omega,1+\Delta\omega]$ are shown in the right column of Fig. 5. In Fig. 5b, A3C outperforms GRAPE when $\Delta\omega\gtrsim 0.22$ , while in Fig. 5d, A3C outperforms GRAPE in an even larger range $\Delta\omega\gtrsim 0.07$ .

Overall we conclude that in the case of spontaneous emission, the A3C algorithm provides comparable results to GRAPE, although it cannot give higher QFIs. Nevertheless, A3C has much greater generalizability, as is consistent with the case concerning the dephasing dynamics.

II.3 Sequences with Gaussian Pulses

For all results shown above, the control sequences involve square pulses only. In practical experiments, shaped pulses are sometimes used. Therefore in this section we consider Gaussian pulses as an example. The total time $T$ is still divided into smaller pieces with $\Delta T$ . However, at the $j$ th piece the piecewise constant pulse is replaced by a Gaussian centering on that piece and truncated on the ends:

[TABLE]

where $A^{(j)}$ indicates the amplitude and $\sigma^{{\rm g},(j)}$ the flatness of the pulse. We demonstrate here that with A3C method it is natural to accommodate non-boxcar pulses.

In Fig. 6 we show A3C results using Gaussian pulses and compare them to GRAPE results using square pulses. Figure 6a-c show results under dephasing dynamics with $\vartheta={\mathrm{\pi}}/4$ , and Fig. 6d-f results under the spontaneous emission. In both cases $\Delta T=1$ , $T=10$ . For dephasing dynamics, our best results from A3C outperform GRAPE, as is also the case for square pulses generated by A3C. For spontaneous emission, our best performing result has a QFI value slightly lower than those from GRAPE with square pulses, but their values are very close. These results indicate that A3C method can naturally accommodate pulses other than square shape. We note that our use of Gaussian pulses is theoretical, and in practical situations, experimentally more relevant ones such as the Blackman pulses Goerz2014 should be used. These shaped pulses are implemented by introducing constraints to the gradient in GRAPE Skinner2010 or by modifying the action from the RL agent directly.

III Discussion

The generalizability of RL, or sometimes called “generalization” in the literature, is an actively studied topic in computer science, for example on problems related to game playing where the RL agent trained under one level of the game can be used to clear other levels Pathak2017 ; Burda2018 ; Nichol2018 ; Cobbe2018 . While the reason why RL is generalizable is not completely clear, one suggestion has it that it likely arises from the underfitting by the neural network to the training data Mackay2003 , which is supported by studies showing that reducing overfitting improves generalizability Cobbe2018 .

The generalizability in fact has a much wider scope than what has been studied here. In the so-called “transfer learning” Taylor2009 , experiences gained from one training of the RL agent can be used to improve its performance on different but related tasks by, for example, minimal updates of the network parameters. In contrast, our method does not alter network parameters while only generalizes the neural network in new RL environments with different parameters to estimate. We therefore believe that RL can be made even more generalizable by further studies involving more sophisticated algorithms.

To summarize, RL, in particular the A3C algorithm, is capable of finding the control protocol that enhances QFI in a way comparable to the traditionally-used GRAPE method, and is in certain situations superior than GRAPE, e.g. for pulse sequences with larger time steps. Moreover, RL can naturally accommodate non-boxcar pulse shapes. Nevertheless, the key advantage afforded by RL is the generalizability, namely the neural network trained for one estimated parameter value can efficiently generate pulse sequences that provide reasonably enhanced QFI for a broad range of parameter values, while in order to achieve the same level of QFI the GRAPE algorithm has to be applied in full each time with a new parameter estimation. Our results therefore suggest that RL-based methods can be powerful alternatives to commonly used gradient-based ones, capable to find control protocols that could be more efficient in practical quantum parameter estimation.

Methods

In this section we describe the RL framework shown in Fig. 1. We also provide an expansive review of the RL methods and the detail on implementation in the Supplementary Methods.

Figure 1a shows the RL agent who takes an action as prescribed by a neural network. In our problem, the action is essentially the control field which steers the qubit according to the master equation, Eq. (2), and the resulting state of the evolution determines the reward the agent receives. In practice, the reward encodes the QFI, i.e. higher reward will be obtained when greater QFI is given by the control.

The action taken by the agent implies a time evolution of the quantum state according to Eq. (2) with the control field, $u_{k}(t)$ . All possible actions therefore form a continuous set. We solve this problem using the Actor-Critic algorithm sutton2018 , as shown in Fig. 1b. Such algorithm is particularly suitable to our problem as it can treat continuous actions. The key of the algorithm is that the neural network is not only updated using the reward, but also a state value, the latter of which greatly improves the efficiency of the training procedure. At certain time step, the neural network takes the density matrix of the quantum state as an input, and outputs both an action, and a state value which assesses how likely the state will lead to a larger QFI. The state is then evolved using the output action, obtaining the new state and QFI, which is then implemented into the reward. The reward and state value combines into a so-called “loss function” that provides feedback, by updating the neural network, for the RL agent to make better decisions. The RL agent takes the new quantum state to repeat the above step until time $T$ is reached, concluding one “episode” of training. After that, the quantum state is reset for the next episode to begin with. A completed episode outputs a pulse profile by sequencing the actions taken in each time step.

In order to improve the efficiency of computation, we used a parallel version of the Actor-Critic algorithm called Asynchronous Advantage Actor-Critic (A3C) algorithm Mnih2016 . In this case, several copies of the agent and environment (called local agents and environments) run in parallel, and as each of them finishes one episode, the solution is delivered to a global agent for further optimization. The optimal policy among these results is then regarded as the output from one “epoch” of training, i.e. one epoch involves several episodes of training from different local agents. Since different local agents deliver their results at different times, the procedure is asynchronous. The details of both the Actor-Critic and the A3C algorithm are described in the Supplementary Methods, as well as the pseudo-code describing the implementation of the algorithm.

Data availability

The datasets generated during this study are available from the corresponding author upon reasonable request.

Code availability

The code used to generate data is available from the corresponding author upon reasonable request.

Acknowledgements

This work is supported by the Research Grants Council of the Hong Kong Special Administrative Region, China (Grant Nos. CityU 21300116, CityU 11303617, CityU 11304018, CUHK 14207717), the National Natural Science Foundation of China (Grant Nos. 11874312, 11604277, 11874292, 11729402, 11574238), the Guangdong Innovative and Entrepreneurial Research Team Program (Grant No. 2016ZT06D348), and the Key R&D Program of Guangdong province (Grant No. 2018B030326001).

Author contributions

X.W. and H.Y. conceived the project, H.X. and J.L. performed calculations. All authors discussed the results and implications at all stages and wrote the paper.

I Supplementary Methods

I–S1 Reinforcement learning

The Reinforcement Learning (RL) framework is schematically shown in Fig. 1a. The key ingredients of the RL process include a state space $\mathcal{S}$ , an action space $\mathcal{A}$ , and a reward $\mathcal{R}$ sutton2018 . In the RL procedure, an agent at state $s_{j}\in\mathcal{S}$ chooses an action $a_{j}\in\mathcal{A}$ according to a probabilistic policy $\pi_{\theta}(a_{j}|s_{j})$ where $\theta$ represents parameters of the policy. For example, when using a neural network to represent the policy, $\theta$ represents the weights and biases of the neural network. The action $a_{j}$ results in a new state $s_{j+1}$ according to which the agent receives a numerical reward $r_{j+1}\in\mathcal{R}$ . For a given optimization problem, one encapsulates the goal of the problem into the calculation of the rewards, as well as relevant constraints in the available states and actions. In practice, the reward for a given state is not only related to its immediate next step, but several steps in its future, so the total discounted reward for $s_{j}$ , a key quantity, is given by

[TABLE]

where $\alpha\in(0,1]$ is the reward decaying rate indicating the relative weight between adjacent steps in calculating the total discounted reward received at a given step. When $\alpha=1$ the rewards from all future steps contribute equally, while when $\alpha\rightarrow 0$ only the immediate next step provides the major contribution. Then, the probability that the agent takes certain action is enhanced or suppressed, according to the value of the total discounted reward. After sufficient iterations of training, the agent learns the optimal actions to take in order to maximize the total discounted reward, thereby gives an optimal solution to the desired problem.

In the RL procedure, the exploration of the agent in the state and action spaces is summarized into a sequence $s_{0},a_{0},r_{1},s_{1},a_{1},r_{2},\ldots,s_{k},a_{k},r_{k+1},\ldots$ , called a trajectory. To figure out what is the best action to take at state $s$ , we define the state-action value function,

[TABLE]

where the expectation includes discounted rewards of all the trajectories after taking the action $a$ at the state $s$ in the $j$ th step of the trajectory, provided that the policy $\pi$ is observed thereafter sutton2018 . We also define the value of a state to evaluate the likelihood that a given state would lead to a higher reward,

[TABLE]

where the expectation includes discounted rewards of all the trajectories starting from the state $s$ in the $j$ th step, provided that the policy $\pi$ is followed thereafter sutton2018 .

An RL policy $\pi$ is declared “optimal” when the actions selected by the policy in each state are such that the resulting expectation value of discounted rewards for all states $s\in\mathcal{S}$ is no less than that from any other policy $\pi^{\prime}$ , i.e. $V^{\pi}(s)\geq V^{\pi^{\prime}}(s)$ sutton2018 . Corresponding to the optimal policy $\pi_{\theta^{*}}(a|s)$ , the optimal value functions are

[TABLE]

where the notations $\theta^{*}$ and $\theta_{\rm v}^{*}$ represent optimal choices of the neural network parameters for the policy and value functions. If the optimal value functions are known, the RL agent simply chooses the action $a_{j}$ that has the largest state-action value $Q^{*}(s_{j},a_{j})$ in state $s_{j}$ . Alternatively, at state $s_{j}$ one may choose the next state $s_{j+1}$ that has the largest state value $V^{*}(s_{j+1})$ . Thus, there are two ways for an RL algorithm to solve an optimization problem: the agent either learns the optimal policy, or if the policy is otherwise specified, the optimal value functions [SKonda2003, ]. The two methods are discussed below.

In the so-called value-based method, the RL agent learns optimal value functions. The state value function and state-action value function are solved iteratively using the Bellmann equations,

[TABLE]

where $a^{\prime}$ represents all possible actions in the next state $s^{\prime}$ sutton2018 . We define the loss functions as

[TABLE]

where $R_{j}^{n}=\sum_{k=1}^{n}\alpha^{k-1}r_{j+k}$ is called the “ $n$ -step” return sutton2018 . We take the $\varepsilon$ -greedy policy commonly used in deep Q-learning network Mnih2015 ; Zhang2018 as an example. Under this policy, the RL agent does either of the two things at state $s_{j}$ : with probability $1-\varepsilon$ the agent takes the action $a_{j}$ that maximizes $Q^{\pi}(s_{j},a_{j})$ , or with probability $\varepsilon\in(0,1]$ an action is randomly chosen. The latter mechanism encourages the agent to explore a wider range in the search space to reach a globally optimal solution. In practice, $Q^{\pi}(s,a)$ in the loss function Eq. (S-8) is the prediction by the neural network and $r+\alpha\max_{a^{\prime}}Q^{\pi}(s^{\prime},a^{\prime})$ is calculated from the trajectories of the RL agent. The training procedure of the neural network is essentially minimization of the loss function, during which the state-action values given by the neural network are improved.

We note that in the value-based algorithm, the policy is fixed, and only the value functions are updated, which may not be sufficient to find a globally optimal solution [SKonda2003, ]. More importantly, the way of storing the action space and trajectories have assumed that the actions are discrete, and it becomes far more complicated to treat problems with continuous actions, as is the case of control fields. As shall be discussed below, the policy-based algorithm is most suitable for our problem.

The policy-based algorithm directly updates the policy parameters $\theta$ without the need of storing a large amount of RL trajectories. A typical form of the loss function is defined as [SWilliams1992, ]

[TABLE]

where $A_{j}$ is the advantage function,

[TABLE]

which evaluates the advantage of the chosen trajectory, with the baseline function $b(s_{j})$ , normally being the estimated state value function, that reduces the variance and speeds up the learning process sutton2018 ; Mnih2016 . When the value of $A_{j}$ is large for an action $a_{j}$ , minimizing $L$ increases $\pi_{\theta}(a_{j}|s_{j})$ , implying that the probability to choose the action $a_{j}$ in state $s_{j}$ is increased.

In our problem, a quantum state is completely described by the density matrix $\hat{\rho}^{(j)}$ for each time step $j$ . Therefore our state in the RL procedure is defined using elements of the density matrix as

[TABLE]

Our action space is formed by a set of control fields $\left(u_{1}^{(j)},u_{2}^{(j)},...,u_{p}^{(j)}\right)\equiv a_{j}\in\mathcal{A}$ , which steers our quantum state $s_{j}$ to $s_{j+1}$ according to the master equation Eq. (2). Evaluation of the new state $s_{j+1}$ and the agent obtains the single step reward:

[TABLE]

where $F$ and $F_{0}$ are the corresponding QFI from Eq. (3) with and without control, respectively. $\eta\geq 1$ and $C\geq 1$ are constant parameters used in the training process. $\eta$ ensures a non-zero reward to the agent in case the RL agent would apply $u_{1,2,3}(t)=0$ , while $C$ gives an extra significance to the last evolution step. After an episode of training, the action sequence in each trajectory constitutes the control field. We also note that our choice of the reward function is not unique.

I–S2 Actor-critic algorithm

The Actor-Critic algorithm combines the advantages of policy-based and value-based methods. Figure 1b illustrates the basic procedure of the Actor-Critic algorithm. Two neural networks are involved: the actor network governing the policy that chooses actions, and the critic network managing the value functions, which in turn changes the baseline function used in further policy-making sutton2018 . More specifically, the state value $V^{\pi}(s)$ generated by the critic network is plugged into Eq. (S-11),

[TABLE]

Note that the “ $n$ -step” return is used instead of $R_{j}$ so that only the $n$ future steps are involved. This is the key distinction from the policy-based method sutton2018 . In the training process, the actor and the critic networks minimize the loss function simultaneously. We update the critic network through Eq. (S-9) while the actor network is trained through Eq. (S-10) using the advantage function defined by Eq. (S-16).

In order to improve the efficiency of the learning process, a parallellized version of Actor-Critic algorithm called A3C, short for Asynchronous Advantage Actor-Critic Mnih2016 , is implemented in our calculation.

I–S3 Asynchronous advantage actor-critic algorithm

The key structure of Asynchronous Advantage Actor-Critic (A3C) is sketched in Fig. S1. The desired policy and value functions are generated by the neural network (left column in Fig. S1), called the “global” network. The neural network is composed of the state value network $V^{\pi}(s)$ (orange color), the policy network $\pi(a|s)$ (green color) and the fully-connected linear layers (blue color). At the beginning of the training process, we made $N^{\mathrm{env}}$ copies of the global network, called “local” networks. Then, each of the local networks is allowed to run in independent RL environments, in which the RL agents, called the “local" agents, optimize policies and value functions via gradients with respect to the loss functions. At the end of a training episode for each parallel RL procedure, the local agent uploads the accumulated gradient to update the global network. Then, the updated global network is downloaded back to the local environment, starting a new episode with the environment properly reset. Note that in the entire process, all local agents act independently, which is why the algorithm is asynchronous [Mnih2016, , Sa2c, ].

We now give details of our implementation of the A3C algorithm. The RL states are first fed through 4 hidden layers, each composing 200 ReLU units [SPaszke2017, ]. The resulting outputs are then passed to both the value and policy networks. The value network is constructed by one hidden layer with 200 ReLU units and one fully-connected linear layer outputting a real number as the state value. The policy network has one hidden layer with 200 ReLU units and two fully-connected linear layers as output layers. The outputs are six real numbers $\mu_{k}$ , $\sigma^{\rm G}_{k}$ , $k=1,2,3$ forming three normal distributions $N(\mu_{k},\sigma^{\rm G}_{k})$ . Here, $\mu_{k}$ is modified by the SoftShrink $(\lambda)$ activation function with $\lambda=0.25$ and $\sigma^{\rm G}_{k}$ is modified by the SoftPlus activation function [SPaszke2017, ]. The continuous actions $u_{k}$ are randomly sampled from those normal distributions.

We use the differentiation of the normal distribution as the entropy regularization term, $-\frac{1}{2}(\log(2\pi\sigma^{2})+1)$ , to encourage the agent to explore the entire search space. We use the RMSProp optimizers with shared parameters that are updated asynchronously among parallel environments Mnih2016 . We keep the choice of hyper-parameters which are listed in the left column of Table S-I similar to those used in Mnih2016 . The pseudocode for A3C can be found in Mnih2016 . Next we will discuss an optimized version of the code, i.e. with Proximal Policy Optimization (PPO) algorithm [Schulman2017, , SHeess2017, ].

Generally, optimization with the logarithm of the policy gradient leads to large policy updates which, in some cases, makes the learning process unstable. The Proximal Policy Optimization (PPO) algorithm replaces the logarithm in Eq. (S-10) with the probability ratio between the old and the new policy:

[TABLE]

and the loss function is also truncated at certain values of the probability ratio Schulman2017 . Algorithm 1 shows the pseudocode for the A3C algorithm utilizing the PPO strategy. In this algorithm, we replace the global RMSProp optimizer with the thread-specified Adam optimizers [SPaszke2017, ]. The right column of Table S-I lists the hyper-parameters in the A3C algorithm with PPO strategy.

We have used PyTorch [SPaszke2017, ] to implement the algorithms and QuTip [SJohansson2012, , SJohansson2013, ] to obtain numerical solutions of Eqs. (2)-(3). We also note that practically, when $\Delta T=1$ , we have to set smaller learning rates, gradient norm, entropy weight and $N^{\mathrm{ppo}}_{\mathrm{max}}$ .

II Supplementary Discussion

II–S1 Computational complexity

In our discussion, the computational complexity refers to the time complexity which depends on the number of elementary operations performed during the execution of the algorithm. For the optimal control problem we considered, the evolution time between [math] and $T$ is discretized into $N$ equal time steps. In most cases, we employ piecewise constant pulse sequences so if we want to compute the evolution of a density matrix from time 0 to $T$ in $N$ time steps with piecewise constant pulses, we need to compute the master equation at $N$ time intervals and the time complexity scales with $N$ . Accordingly, we compare the time complexity of A3C and GRAPE with respect to a system size of $N$ .

In one episode of A3C, we take the probe state as the input to the RL algorithm, which keeps running until time $T$ is reached. During this process, we have used the master equation $N$ times in the RL environment. The computational complexity is therefore $\mathcal{O}(N)$ . On the other hand, the time cost of training the neural network is dependent on the network structure (number of neurons, layers etc.) which is irrelevant to GRAPE. Therefore for the purpose of comparing to GRAPE, the time complexity $\mathcal{O}(N)$ includes the cost of training which adds a prefactor dependent on the details of the network. For GRAPE, according to the analytical results of the gradient of QFI in Ref. Liu2017 , we need to compute the evolution of the density matrix $N^{2}$ times to numerically evaluate the gradient with respect to the control $u_{k}^{(j)}$ at time step $j$ . Thus, computing the gradient of QFI with respect to $u_{k}^{(j)}$ causes the complexity $\mathcal{O}(N^{2})$ . During one iteration of GRAPE, we want to update $N$ piecewise controls so the complexity further increases to $\mathcal{O}(N^{3})$ . One should note that optimizing the QFI is computationally more expensive than optimizing the fidelity with GRAPE Khaneja2005 .

We verify our results on a PC with the standard multi-core CPU and plot their wall-clock time costs as functions of the system size $N$ in Fig. S2. It shows that the wall-clock time costs in one training epoch of A3C and one iteration of GRAPE follows the scaling $\mathcal{O}(N)$ and $\mathcal{O}(N^{3})$ respectively, as expected.

We now count into the number of training epochs (A3C) or iterations (GRAPE). In actual implementations, the number of training epochs in the A3C algorithm is usually $\sim 10^{3}$ , while the number of iterations in GRAPE is typically between $10^{1}$ and $10^{2}$ . For small $N$ ( $N\lesssim 10$ ), a full execution of GRAPE can be faster than A3C due to its smaller prefactor of the number of iterations. However, this case corresponds to a larger $\Delta t$ for which we know that the result of A3C may outperform GRAPE in QFI. Therefore we summarize the comparison as follows: For small $N$ , A3C may be slower than GRAPE but can produce results with higher QFI, and is more generalizable. For large $N$ , A3C is overall faster than GRAPE, producing results with QFI comparable (but not exceeding) GRAPE, and is more generalizable. We believe it is fair to conclude that A3C is more efficient in more experimentally relevant cases, i.e. having larger $N$ or when generalizability is desired.

II–S2 Additional results on dephasing dynamics

In the main text, we have provided results of quantum parameter estimation under dephasing dynamics along a chosen axis in Fig. 2, i.e. $\vartheta=\pi/4$ . Here, we present results along two other axes: parallel depasing $(\vartheta=0)$ and transverse dephasing $(\vartheta=\pi/2)$ . In Fig. S3, the training process is shown in the upper row, $F(T)/T$ v.s. $\omega_{0}$ the middle row and the average $F(T)/T$ in $[1-\Delta\omega,1+\Delta\omega]$ in the bottom row. For parallel dephasing, our results are very similar to $\vartheta=\pi/4$ results shown in the main text, namely $F(T)/T$ calculated from A3C is lower than that from GRAPE only in a narrow range of $\Delta\omega$ . For $\Delta T=0.1$ , A3C outperforms GRAPE when $\Delta\omega\gtrsim 0.15$ , while for $\Delta T=1$ , A3C is better than GRAPE in a wider range, $\Delta\omega\gtrsim 0.05$ . For transverse dephasing, the situation is slightly more complicated (note that analytical solutions Liu2017 are provided as references). When $\Delta T=1$ , results from GRAPE has very low $F(T)/T$ , thus A3C always outperforms GRAPE. However, for $\Delta T=0.1$ , A3C does not possess considerable advantages. For $0\leq\Delta\omega\lesssim 0.4$ , the A3C results have lower $F(T)/T$ than GRAPE, albeit being very close. For $\Delta\omega\gtrsim 0.4$ , the A3C results is only slightly higher than GRAPE. These calculations therefore suggest that the generalizability of our method is superior as compared to GRAPE in most situations, in particular for cases with larger time step $(\Delta T)$ . Nevertheless, in some situations, usually associated with smaller $\Delta T$ , our method would not provide considerable improvement. One therefore has to be judicious in choosing appropriate methods for a specific problem. For example, if generalizability is not desired, GRAPE may be more appropriate for pulse sequences with smaller time steps. On the other hand, if pulse sequences have larger time steps, or generalizability becomes important in the problem, the A3C method is desired.

References

(1)

Konda, V. R. & Tsitsiklis, J. N.

On actor-critic algorithms.

SIAM J. Control Optim. 42, 1143–1166 (2003).

(2)

Williams, R. J.

Simple statistical gradient-following algorithms for connectionist reinforcement learning.

Machine learning 8, 229–256 (1992).

(3)

Seita, D.

Actor-Critic methods: A3C and A2C.

https://danieltakeshi.github.io/2018/06/28/a2c-a3c/.

Accessed April 19, 2019.

(4)

We note that “asynchronism" is not a necessary condition since one may train the agents synchronously using batched experiments in a parallel fashion [SWu2017, ].

(5)

Paszke, A. et al.

Automatic differentiation in PyTorch.

In NIPS-W (2017).

(6)

Heess, N. et al.

Emergence of locomotion behaviours in rich environments.

arXiv preprint arXiv:1707.02286v2 (2017).

(7)

Johansson, J. R., Nation, P. D. & Nori, F.

QuTiP: An open-source Python framework for the dynamics of open quantum systems.

Comput. Phys. Commun. 183, 1760–1772 (2012).

(8)

Johansson, J. R., Nation, P. D. & Nori, F.

QuTiP 2: A Python framework for the dynamics of open quantum systems.

Comput. Phys. Commun. 184, 1234–1240 (2013).

(9)

Wu, Y., Mansimov, E., Liao, S., Grosse, R. & Ba, J.

Scalable trust-region method for deep reinforcement learning using kronecker-factored approximation.

arXiv preprint arXiv:1708.05144v2 (2017).

Bibliography61

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1(1) Kolobov, M. I. The spatial behavior of nonclassical light. Rev. Mod. Phys. 71 , 1539 (1999).
2(2) Lugiato, L., Gatti, A. & Brambilla, E. Quantum imaging. J. Opt. B-Quantum Semicl. Opt. 4 , S 176 (2002).
3(3) Morris, P. A., Aspden, R. S., Bell, J. E., Boyd, R. W. & Padgett, M. J. Imaging with a small number of photons. Nat. Commun. 6 , 5913 (2015).
4(4) Roga, W. & Jeffers, J. Security against jamming and noise exclusion in imaging. Phys. Rev. A 94 , 032301 (2016).
5(5) Tsang, M., Nair, R. & Lu, X.-M. Quantum theory of superresolution for two incoherent optical point sources. Phys. Rev. X 6 , 031033 (2016).
6(6) Giovannetti, V., Lloyd, S. & Maccone, L. Advances in quantum metrology. Nat. Photonics 5 , 222–229 (2011).
7(7) Yuan, H. & Fung, C.-H. F. Optimal feedback scheme and universal time scaling for Hamiltonian parameter estimation. Phys. Rev. Lett. 115 , 110401 (2015).
8(8) Yuan, H. Sequential feedback scheme outperforms the parallel scheme for Hamiltonian parameter estimation. Phys. Rev. Lett. 117 , 160801 (2016).