Neural network agent playing spin Hamiltonian games on a quantum   computer

Oleg M. Sotnikov; Vladimir V. Mazurenko

arXiv:1904.02467·quant-ph·April 22, 2020

Neural network agent playing spin Hamiltonian games on a quantum computer

Oleg M. Sotnikov, Vladimir V. Mazurenko

PDF

TL;DR

This paper presents a neural network-based reinforcement learning agent that interacts with quantum computers to approximate spin Hamiltonian ground states, addressing decoherence and errors in quantum simulations.

Contribution

It introduces a novel autonomous agent framework trained via self-play on quantum devices to solve magnetism problems, incorporating local spin correction techniques.

Findings

01

Agent successfully learns entanglement to replicate ground states

02

Demonstrates effective interaction with noisy quantum hardware

03

Paves the way for neural network eigensolvers on quantum computers

Abstract

Quantum computing is expected to provide new promising approaches for solving the most challenging problems in material science, communication, search, machine learning and other domains. However, due to the decoherence and gate imperfection errors modern quantum computer systems are characterized by a very complex, dynamical, uncertain and fluctuating computational environment. We develop an autonomous agent effectively interacting with such an environment to solve magnetism problems. By using the reinforcement learning the agent is trained to find the best-possible approximation of a spin Hamiltonian ground state from self-play on quantum devices. We show that the agent can learn the entanglement to imitate the ground state of the quantum spin dimer. The experiments were conducted on quantum computers provided by IBM. To compensate the decoherence we use local spin correction…

Equations14

E = \frac{1}{2} (B^{x} ⟨ \hat{X} ⟩ + B^{y} ⟨ \hat{Y} ⟩ + B^{z} ⟨ \hat{Z} ⟩),

E = \frac{1}{2} (B^{x} ⟨ \hat{X} ⟩ + B^{y} ⟨ \hat{Y} ⟩ + B^{z} ⟨ \hat{Z} ⟩),

Q (s, a) = r + γ max {Q (s^{'}, k), k \in a c t i o n s} .

Q (s, a) = r + γ max {Q (s^{'}, k), k \in a c t i o n s} .

ij \sum ⟨ \hat{S}_{i} \hat{S}_{j} ⟩ = 0.

ij \sum ⟨ \hat{S}_{i} \hat{S}_{j} ⟩ = 0.

i \sum ⟨(\hat{S}_{i})^{2} ⟩ = - i \neq = j \sum ⟨ \hat{S}_{i} \hat{S}_{j} ⟩ .

i \sum ⟨(\hat{S}_{i})^{2} ⟩ = - i \neq = j \sum ⟨ \hat{S}_{i} \hat{S}_{j} ⟩ .

X = \frac{S ( S + 1 )}{\frac{1}{N} \sum _{i} ⟨( S ^ _{i} ) ^{2} ⟩}

X = \frac{S ( S + 1 )}{\frac{1}{N} \sum _{i} ⟨( S ^ _{i} ) ^{2} ⟩}

∣Ψ (θ)⟩ = U_{CNOT} U_{3} (θ) ∣00 ⟩,

∣Ψ (θ)⟩ = U_{CNOT} U_{3} (θ) ∣00 ⟩,

θ_{k + 1} = θ_{k} - α_{k} g_{k} (θ_{k}),

θ_{k + 1} = θ_{k} - α_{k} g_{k} (θ_{k}),

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Neural network agent playing spin Hamiltonian games on a quantum computer

Oleg M. Sotnikov

[email protected]

Vladimir V. Mazurenko

Theoretical Physics and Applied Mathematics Department, Ural Federal University, 620002 Ekaterinburg, Russia

Abstract

Quantum computing is expected to provide new promising approaches for solving the most challenging problems in material science, communication, search, machine learning and other domains. However, due to the decoherence and gate imperfection errors modern quantum computer systems are characterized by a very complex, dynamical, uncertain and fluctuating computational environment. We develop an autonomous agent effectively interacting with such an environment to solve magnetism problems. By using the reinforcement learning the agent is trained to find the best-possible approximation of a spin Hamiltonian ground state from self-play on quantum devices. We show that the agent can learn the entanglement to imitate the ground state of the quantum spin dimer. The experiments were conducted on quantum computers provided by IBM. To compensate the decoherence we use local spin correction procedure derived from a general sum rule for spin-spin correlation functions of a quantum system with even number of antiferromagnetically-coupled spins in the ground state. Our study paves a way to create a new family of the neural network eigensolvers for quantum computers.

Reinforcement machine learning techniques were initially developed for creating autonomous intelligent robotic systems thesis . Currently, they are being successfully applied in completely different decision making domains such as games game ; Go1 ; Go2 , traffic control systems traffic , computer resources management management , news recommendation systems news , optimization of chemical reactions chemical and others. Within a reinforcement learning approach an agent taking some actions interacts with environment, receives feedback, estimates rewards and corrects its actions to increase a future reward. This idea is very attractive, since in such a formulation the agent is fully autonomous and can develop different strategies to gain more. However, a practical realization of a reinforcement learning technique is problem specific and requires additional innovations providing the stability and convergence of numerical schemes.

Reinforcement learning techniques have been actively developed and implemented in such a new field of research as quantum computing. It includes quantum-error-correction systems in complex quantum devices petru ; qerr1 ; qerr2 ; qerr3 , design and implementation of quantum communication technologies arxiv1 , quantum gate control arxiv2 ; arxiv21 ; arxiv22 ; arxiv23 , quantum gate design arxiv3 , quantum algorithms for reducing computational error arxiv4 . It is important to note that only a few of these algorithms were tested on real quantum devices.

Motivated by recent results of Google DeepMind team game obtained for classic Atari games in this work we develop and practically realize a reinforcement learning scheme for approximating the ground states of spin Hamiltonians on quantum computers. In this field of quantum computing there are two approaches widely used to simulate magnetic systems. The first one is an adiabatic simulation method evolu1 ; evolu2 ; evolu3 ; Pogosov that is based on the discretization (Trotterization) of the time evolution operator. Another one is a variational quantum eigensolver proposed in Refs. VQE, ; Troyer, ; Malley, that uses the Ritz’s variational principle to prepare approximations of the ground state of a magnetic model. Being different by the construction both approaches assume to use some fixed sequences of quantum gates.

Practical applications of these methods for the simplest quantum spin Hamiltonian on real home-made and public quantum devices Pogosov ; IBMQE ; IBM1 ; evolu2 have revealed problems that are mainly related to the decoherence and gate errors. For instance, the experiments aimed to spin Hamiltonian dynamics Pogosov have demonstrated that such errors become more and more significant as the length of the quantum program increases. As a result, a few Trotter steps lead to a considerable error in the experimental data in comparison with exact results. On the other hand, the variational procedure VQE requires a calibration of gradient-descent method parameters to probe the energy landscape in the vicinity of the state defined with current set of parameters. These additional measurements are also source for errors.

In our study we follow a distinct logic and consider a spin Hamiltonian problem as a game with the following rules. Starting with a random quantum state a player performs several quantum actions and measurements to get the best score that means the lowest energy and, as a result, the best approximation of the spin Hamiltonian ground state. To play this game we develop a multi-neural-network agent that determines a sequence of quantum gates for a short quantum circuit. In contrast to previous approaches we do not use a fixed sequence of quantum gates, and at each iteration the agent chooses a new gate for quantum circuit depending on the current state of a quantum device on the basis of the calculated correlation functions. During the training process the agent writes short quantum programs and runs them on a simulator with noise. Here we apply the best latest expertise in the field of the reinforcement learning game . For instance, we use the experience replay mechanism repeatedly presenting the past experiences to its learning algorithm and an iterative update procedure adjusting the action-values (Q) towards target values. They provide the stability of the whole scheme running on the noisy quantum systems that are simulator and real device. Having trained the agent on the quantum simulator by using the developed reinforcement learning technique we demonstrate its performance on real IBM Quantum Experience devices.

1-qubit problem

We start with description of a single-spin Hamiltonian problem to explain details of our approach. The Hamiltonian is given by $\hat{\rm H}={\bf B}\hat{\bf S}$ , where ${\bf B}$ is the external magnetic field and $\hat{\bf S}$ is the spin- $\frac{1}{2}$ operator. The components of the external magnetic field were chosen to be ${\bf B}$ =(1,1,1). The solution of the problem can be obtained with universal $U_{3}(\theta,\phi,0)$ gate that acts on a qubit in the initial state, $\ket{0}$ . Here $\theta\approx 2.186276$ rad and $\phi=-\frac{3\pi}{4}$ . It gives the ground state $\ket{\Psi_{0}}=U_{3}\ket{0}$ . The definitions of the gates we use are given in the Supplementary Material supplementary .

For the considered single spin problem the energy is given by

[TABLE]

where $\braket{\hat{Z}}$ , $\braket{\hat{X}}$ and $\braket{\hat{Y}}$ are the correlation functions calculated by using the probabilities of the basis states, which is a standard output of a quantum computer. It is important to note that these correlators estimated with

real quantum computer are subjected by decoherence and gate imperfection errors. The simulations of these errors is an active field of research noise1 ; noise2 ; noise3 . To take these effects into account we used a noise generator as implemented in Qiskit Aer module noise_Aer . It includes gate and readout errors imitating the real quantum device noise approximated as a relaxation process of qubits involved in the experiment. Despite the noise model of the local simulator is simple and takes into account only local errors occurring on each gate, we will show that inclusion of device noise even on this level significantly improves the agreement between results of simulations and experiments obtained with real device. We used a basic device noise model with parameters reported in official Qiskit tutorials for IBM Q 16 Melbourne device qiskit_tutorials .

With circuits presented in Fig. 1 we have performed 100 independent calculations on simple quantum simulator, 100 calculations on the simulator with noise and 100 experiments on the real quantum device. In all the cases we observe fluctuations of the measurements results. For instance, for the simple simulator the energy fluctuates around exact value. The account of decoherence and gate imperfections with the noise model leads to a higher average energy that is about -0.8. Real experimental results obtained with IBM Q 16 Melbourne device are characterized by very strong fluctuations of the energy around -0.75.

These experiments clearly demonstrate that even for the simplest problem one deals with complex, dynamical, uncertain and fluctuating quantum computing environment. It motivates to develop an autonomous agent effectively interacting with such an environment.

Neural network agent

The agent we develop is multi-network one in according with a one-action-one-network concept proposed in Ref. thesis, . There is a separate network for each action, but the structures of all the networks are the same (Fig. 2). They contain input, one hidden and output layers. The network takes spin correlation functions obtained with quantum computer or simulator as an input. The number of the input neurons depends on the problem we solve. For instance, in the case of the one-qubit problem (single spin in an external magnetic field) there are three correlation functions that form a state vector ${\bf s}=\{\braket{\hat{X}},\braket{\hat{Y}},\braket{\hat{Z}}\}$ characterizing the quantum system in question. In turn, there will be 6 single-spin and 9 spin-spin correlation functions (15 in total) in the case of the dimer spin Hamiltonian problem.

The number of the neurons in the hidden layer also depends on the size of the problem. In the one- and two-qubit cases we consider the number of the hidden neurons is equal to 32 and 64, respectively. The output layer contains only one neuron that represents the predicted action-value Q for the particular quantum gate (action). Having compared the calculated Q-values among all the networks the agent chooses the gate for which action-value function has the maximal value.

In this way one defines an optimal $Q$ -function satisfying thesis

[TABLE]

According to this expression the utility of an action $a$ in response to a state ${\bf s}$ equals to immediate reward $r$ plus the best utility that can be obtained from the next state ${\bf s^{\prime}}$ discounted by factor $\gamma$ . During reinforcement learning, the difference between the two sides of Eq. (2) is to be minimized using a back-propagation algorithm supplementary .

Quantum computer programming

The developed agent is aimed to define a quantum circuit according to the predicted action values Q. In the case of the one-qubit problem we use the following set of gates: identity gate, $U_{\theta}$ , $U_{\phi}$ , $U_{-\theta}$ , $U_{-\phi}$ , $U_{\theta/2}$ , $U_{\phi/2}$ , $U_{-\theta/2}$ and $U_{-\phi/2}$ that are defined with the universal rotation gate $U_{3}$ as described in Ref. supplementary, .

The process of the quantum circuit construction can be demonstrated by the example of Fig. 3. We start with an universal $U_{3}$ gate with random angles. Having added a new gate to the quantum circuit the measurements of the correlation functions are performed. It means that one can trace the energy evolution as the length of the quantum circuit increases.

Agent’s training

The agent was trained with the IBM QE simulator including the noise model. Each training iteration contains three main steps. (i) The agent takes some action following $\epsilon$ -greedy policy. Having added a new gate to the quantum circuit the agent estimates the reward from the observation, $r=E_{t}-E_{t+1}$ . (ii) The sequence (state, action, reward, new state) explored by the agent is stored in the replay memory. (iii) A sequence randomly chosen from the memory is used to optimize the weights of the neural network. A complete description of technical details on the agent’s training is presented in Ref. supplementary, . From Fig. 4 it follows that the 10-gate agent has demonstrated the best average reward during the training.

Experiments

To examine the performance of the trained agent we performed experiments on real quantum device provided by IBM. Fig. 5 demonstrates these results on the level of the individual real quantum device experiments. One can see that the agent decreases the energy of the system starting from different random states and approaches to the $E$ = -0.75 defined as the average energy for the circuit (Fig. 1b) simulating the exact solution of the problem. If the energy of the initial random state is low enough and close to the ground state the agent follows a passive strategy trying to keep such a winning position. On the other hand, a high-energy initial state is a signal to the agent to decrease the energy as much as possible from the currect position. More results and discussions can be found in Supplementary Material supplementary .

Spin dimer

Having discussed the single-qubit problem we are in position to consider the agent’s training in the two-qubit case. The corresponding spin model contains two antiferromagnetically-coupled spins, ${\rm\hat{H}}=J\hat{\bf S}_{1}\hat{\bf S}_{2}$ . Here $J$ is isotropic exchange interaction. The ground state of this model is the singlet state $\ket{\Psi_{0}}=\frac{1}{\sqrt{2}}(\ket{\uparrow\downarrow}-\ket{\downarrow\uparrow})$ . Such an entanglement state can be realized on a quantum computer with the circuit presented in Fig. 6 (a). However, the experiments on the IBM Q Vigo (Fig. 6 (b)) have shown significant contributions of the triplet states, $\ket{\uparrow\uparrow}$ and $\ket{\downarrow\downarrow}$ . As a result the experimental ground state energy of -0.51 is considerably higher than the exact value of -0.75. It also explains disagreement between experiment and theory on ground state of the Heisenberg model defined on single square reported in Ref. VQE, .

Another important observation is that an experiment conducted on the real IBM Q Vigo device gives the correlation functions that differ from the data of other experiments, which means that there are fluctuations of the correlation functions and energies within the series of independent experiments (Fig. 6 (c)). Moreover, sets of the experiments conducted with the same circuits but at different periods of time can give different average energies supplementary .

In the two-qubit case the action list for the agent includes the same gates as for single-qubit agent and additional Controlled NOT (CNOT) gate that is responsible for the entanglement supplementary . The agent was trained starting with different random classical ground states. Such a non-entangled state is formed in the following way. The initial state of the first qubit is changed with random $U_{3}$ gate. Then the state of the second qubit is set to be antiparallel to the first one on the level of the Bloch sphere. Interestingly, the agent has learned from self-play the possibility to overcome the classical energy limit by using the CNOT gate. A specific structure of the CNOT matrix implies an initial preparation by using single qubit rotations to decrease further the total energy with the corresponding entanglement gate.

In Fig. 7 (a) we compare the rewards obtained within the training processes performed with different values of the elementary rotation angle, $\delta$ = 0.5, 1.0 and 1.5 rad (see Supplementary Material for further details on the rotation angle). One can see that the best and more stable results were obtained with $\delta$ = 1. In general the choice of the particular angle value can be also considered as a part of the reinforcement learning algorithm we propose. We left a practical realization of such an option for a future investigation.

Figure 7 (b) shows that the classical energy of the spin dimer estimated with quantum device is about -0.2, which is higher than the exact solution of -0.25. The trained two-qubit agent decreases the energy of the system to the level of -0.6 (Fig. 7 (b)) that was obtained with the singlet state circuit.

The average values denoted with blue lines in Fig. 5 (Fig. 1 (c)) and Fig. 7 (b) (Fig. 6 (c)) were obtained with the shortest quantum circuits (one-gate circuit for single-spin problem and three-gate circuit for dimer problem) imitating the exact solution. In the case of the reinforcement learning technique the agent constructs circuits of the 10-gate length. Thus, the neural network results obtained with longer circuits closely approach to the level of the short circuits data, which is an additional demonstration of the performance of our method.

It is also important to analyze the evolution of the energy obtained with neural network during individual experiments. Some examples given in Fig. 8 demonstrate that the largest energy decreases are mainly achieved with the CNOT gate which can be implemented not only in the beginning of the circuit construction. To keep a minimal energy the agent uses the rotation gates from the action list with smallest angles, such as $\frac{\phi}{2}$ . It can be seen by the examples of the experiments 6, 18 and 21.

Local spin correction

As it was shown above, the direct imitation of the single ground state as well as its neural network approximation on the real quantum device are characterized by the energies which are considerably larger than the exact ground state energy. Such a disagreement between experiment and theory is mainly related to decoherence and gate errors. In this situation it is important to find a way to compensate such hardware imperfections. The different strategies can be used for that. The authors of Ref. Carretta, have simulated time-dependent spin-spin correlation functions of the Heisenberg-type magnetic systems in high magnetic fields. At these magnetic fields the quantum ground state of the spin system is fully polarized. To fit the experimental results obtained with an IBM quantum computer to exact ones the authors introduced a phase-and-scale procedure that is based on an artificial phase correction of the local spin-spin correlation function at the zero time and application of the same correction to the whole time domain. Below we show that for the quantum systems in the singlet ground state a similar correction procedure can be derived from a general sum rule for the spin-spin correlation functions.

For the quantum systems characterized by a singlet-type ground state we propose a local moment correction procedure that systematically improves the agreement between experiment and theory on the ground state energy. The ground state spin-spin correlation functions of a quantum spin system with even number of spins and antiferromagnetic interaction between them satisfies to the following sum rule

[TABLE]

The sum in this equation contains local and non-local contributions that can be decomposed. It gives

[TABLE]

The sum in the right part contains all possible pair correlation functions in the system. Namely these non-local correlation functions are calculated at each iteration of our algorithm with quantum computer and used as input for neural network. Within the correction procedure we use them to estimate the local correlator in the left part of Eq. (4) that should be compared with the exact value $S(S+1)$ . Namely, this exact value is important insider information for us on unknown quantum system. The ratio

[TABLE]

can be used to correct the non-local correlation functions $X\langle\hat{\bf S}_{i}\hat{\bf S}_{j}\rangle$ ( $N$ is the number of spins). It gives opportunity to correct the estimated energy of the Heisenberg system. In the case of the dimer system we have a trivial situation since the nonlocal spin-spin correlation functions $\braket{\hat{\mathbf{S}}_{1}\hat{\mathbf{S}}_{2}}$ and $\braket{\hat{\mathbf{S}}_{2}\hat{\mathbf{S}}_{1}}$ fully define the on-site spin-spin correlators $\braket{\hat{\mathbf{S}}_{1}^{2}}$ and $\braket{\hat{\mathbf{S}}_{2}^{2}}$ . The modified experimental results are presented in Fig. 7 (c). One can see that the most part of the experiments gave very accurate estimation of the exact energy of the spin dimer.

The procedure we propose is of general nature and can be used not only in the case of the neural network solver presented in this work but also in conjunction with other quantum computer eigensolvers.

Comparison with variational quantum eigensolver

Since we propose a new neural network eigensolver it is important to implement previously developed quantum computer approaches to the same ground state problem of the Heisenberg dimer and compare the performance of these methods. As we have discussed in the introduction there are two standard quantum computer eigensolvers, they are variational quantum algorithm and phase estimation method. The comparison of these methods can be found in Ref. Malley, in which the authors performed quantum computer experiments and computed the energy of hydrogen molecule. For the quantum computer experiments this electronic structure problem is formulated in a form of a 2-spin Hamiltonian similar to that we consider in this work. It was demonstrated that the variational quantum approach outperforms the phase estimation algorithm. That is why we perform the comparison the former method with our neural network solver by the example of the spin dimer model. For that a realization of the variational approach reported in Ref. VQE, was used. Within this approach the wave function of a quantum Heisenberg model is stored on a quantum device and represented in the following form

[TABLE]

where $|00\rangle$ is the initial state of the 2-qubit system, $U_{3}$ is a set of rotation gates, $U_{\rm CNOT}$ is the CNOT gate that is responsible for the entanglement and $\theta$ is the set of angles we vary to approximate the ground state of the spin Hamiltonian. The angles at the $(k+1)$ th iteration are defined as

[TABLE]

here ${\bf g}_{k}({\bm{\theta}}_{k})$ is the gradient at ${\bm{\theta}}_{k}$ and $\alpha_{k}$ is the step size parameter. This variational scheme contains a number of parameters for which we use the same values as it was proposed in Ref. VQE, . Every 20 steps we perform a calibration procedure to probe the energy landscape at the quantum state defined by the current set of the rotation angles and to renew the gradient-descent parameters.

The results obtained by using the variational approach with simulator, simulator with noise and real quantum device are presented in Fig. 9. In the case of the ideal quantum simulator we observe excellent agreement between the calculated energies and exact results. The average energy estimated with noisy simulator is -0.7, which is larger than the exact solution of -0.75. In turn, the real quantum device IBM Q Vigo gives the energy of -0.6. It agrees with neural network results (Fig. 7 (b)). The real experiments are characterized by fluctuations of the energy after 200 iterations. They are due to large values of the step size $\alpha$ in Eq. (7) obtained with calibration procedure. Here, one of the possible solutions is to limit the upper bound of $\alpha$ to 2.

Despite the agreement on real device results of variational and neural network solvers there are important differences between them. First of all the neural network results were obtained with circuits of 10 gates. At the same time there are 3 gates (two universal $U_{3}$ gates and one CNOT gate) in the variational scheme. Taking into account that gate errors accumulate as circuit length increases, our neural network approach seems to be stable to such errors. Another important difference is the initial quantum state. For neural network solver we always start with a random initial quantum state, which suggest the way to avoid local energy minima for more complicated spin Hamiltonian problems.

Conclusion

We have developed a neural-network agent approximating the ground state of a spin model on a quantum device. The agent was trained by reinforcement learning from self-play on quantum simulator with noise. Here the general objective of our reinforcement learning approach is to obtain the agent that can play moderately well in all the cases even on noisy real quantum devices. The agent performance was demonstrated on quantum simulator and real quantum devices. In the case of the dimer problem we found that the agent can learn entanglement by applying the CNOT gate. In combination with local spin correction our neural network approach provides excellent agreement with exact dimer solution on the ground state energy.

To consider the systems with more than two qubits, a new strategy should be adopted. For instance, one can use a single deep neural network for all the gates from the action list. In the current formulation of the neural network approach the set of observables will increase exponentially with the system size. Moreover, to compensate the decoherence we need to define all the nonlocal spin-spin correlation functions to get an accurate estimation of the ground state energy. This is the problem if we use classical computers to simulate the neural network. The problem could be solved if the network is realized with another quantum device. Recent studies quantum_network1 ; quantum_network2 have demonstrated that the equivalent of m-dimensional classical input and weight vectors can be encoded on the quantum hardware by using N qubits, where m = $2^{N}$ . Similar approach allows to exploit the exponential advantage of quantum information storage.

Another important problem is a smart selection of the implemented gates, since their number substantially grows as the system size increases. Such a selection can be also fulfilled with machine learning approach PNAS .

Acknowledgements.

This work was supported by the Russian Science Foundation, Grant No. 18-12-00185.

Bibliography38

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1(1) L. J. Lin, Reinforcement learning for robots using neural networks. Technical Report, DTIC Document (1993).
2(2) V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran, D. Wierstra, S. Legg, and D. Hassabis, Nature 518 , 529 (2015).
3(3) D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. van den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, S. Dieleman, D. Grewe, J. Nham, N. Kalchbrenner, I. Sutskever, T. Lillicrap, M. Leach, K. Kavukcuoglu, T. Graepel, and D. Hassabis, Nature 529 , 484 (2016).
4(4) D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, A. Huang, A. Guez, T. Hubert, L. Baker, M. Lai, A. Bolton, Y. Chen, T. Lillicrap, F. Hui, L. Sifre, G. van den Driessche, T. Graepel, and D. Hassabis, Nature 550 , 354 (2017).
5(5) I. Arel, C. Liu, T. Urbanik, A.G. Kohls, IET Intell. Transp. Syst. 4, 128 (2010).
6(6) H. Mao, M. Alizadeh, I. Menache, S. Kandula, Proceedings of the 15th ACM Workshop on Hot Topics in Networks, http://doi.acm.org/10.1145/3005745.3005750.
7(7) G. Zheng, F. Zhang, Z. Zheng, Y. Xiang, N.J. Yuan, X. Xie, Z. Li, Proceedings of the 2018 World Wide Web Conference, https://doi.org/10.1145/3178876.3185994.
8(8) Zhenpeng Zhou, Xiaocheng Li, Richard N. Zare, ACS Cent. Sci. 3, 1337 (2017).