Finding Quantum Many-Body Ground States with Artificial Neural Network

Jiaxin Wu; Wenjuan Zhang

arXiv:1906.11216·cond-mat.dis-nn·June 27, 2019

Finding Quantum Many-Body Ground States with Artificial Neural Network

Jiaxin Wu, Wenjuan Zhang

PDF

Open Access

TL;DR

This paper introduces an unsupervised neural network-based algorithm to find quantum many-body ground states, demonstrating high accuracy on 1D Ising and Heisenberg models without assuming eigenvector forms.

Contribution

It presents a novel neural network method for unbiased ground state determination in quantum many-body systems, improving flexibility and accuracy.

Findings

01

Matches exact diagonalization results for 1D models

02

Unbiased approach without assuming eigenvector forms

03

Controlled accuracy in ground state calculations

Abstract

Solving ground states of quantum many-body systems has been a long-standing problem in condensed matter physics. Here, we propose a new unsupervised machine learning algorithm to find the ground state of a general quantum many-body system utilizing the benefits of artificial neural network. Without assuming the specific forms of the eigenvectors, this algorithm can find the eigenvectors in an unbiased way with well controlled accuracy. As examples, we apply this algorithm to 1D Ising and Heisenberg models, where the results match very well with exact diagonalization.

Equations43

X_{j + 1} = f (W_{j + 1} X_{j} + B_{j + 1}),

X_{j + 1} = f (W_{j + 1} X_{j} + B_{j + 1}),

C = \frac{1}{2} ∣ Y - X_{L} ∣^{2} .

C = \frac{1}{2} ∣ Y - X_{L} ∣^{2} .

W_{j} \mathchar 58 = W_{j} - η \frac{\partial C}{\partial W _{j}}, B_{j} \mathchar 58 = B_{j} - η \frac{\partial C}{\partial B _{j}},

W_{j} \mathchar 58 = W_{j} - η \frac{\partial C}{\partial W _{j}}, B_{j} \mathchar 58 = B_{j} - η \frac{\partial C}{\partial B _{j}},

X_{0} = X_{0}^{R} + i X_{0}^{I},

X_{0} = X_{0}^{R} + i X_{0}^{I},

X_{j + 1}^{R} = f (W_{j + 1}^{R} X_{j}^{R} + B_{j + 1}^{R}), X_{j + 1}^{I} = f (W_{j + 1}^{I} X_{j}^{I} + B_{j + 1}^{I}) .

X_{j + 1}^{R} = f (W_{j + 1}^{R} X_{j}^{R} + B_{j + 1}^{R}), X_{j + 1}^{I} = f (W_{j + 1}^{I} X_{j}^{I} + B_{j + 1}^{I}) .

H \cdot X_{N} = E \tilde{X}_{N},

H \cdot X_{N} = E \tilde{X}_{N},

C \equiv 1 - ∣ \tilde{X}_{N}^{†} \cdot X_{N} ∣^{2} .

C \equiv 1 - ∣ \tilde{X}_{N}^{†} \cdot X_{N} ∣^{2} .

G \equiv ∣ X_{N}^{†} \cdot (H X_{N}) ∣^{2} = ∣ E X_{N}^{†} \cdot \tilde{X}_{N} ∣^{2} .

G \equiv ∣ X_{N}^{†} \cdot (H X_{N}) ∣^{2} = ∣ E X_{N}^{†} \cdot \tilde{X}_{N} ∣^{2} .

W_{j} \mathchar 58 = W_{j} + η \frac{\partial G}{\partial W _{j}}, B_{j} \mathchar 58 = B_{j} + η \frac{\partial G}{\partial B _{j}},

W_{j} \mathchar 58 = W_{j} + η \frac{\partial G}{\partial W _{j}}, B_{j} \mathchar 58 = B_{j} + η \frac{\partial G}{\partial B _{j}},

H_{I s in g} =

H_{I s in g} =

H_{A F H} =

\frac{\partial C}{\partial B _{N}} = \frac{\partial C}{\partial X _{N}} \frac{\partial X _{N}}{\partial B _{N}} = \frac{\partial C}{\partial X _{N}} \frac{\partial f ( Z _{N} )}{\partial Z _{N}},

\frac{\partial C}{\partial B _{N}} = \frac{\partial C}{\partial X _{N}} \frac{\partial X _{N}}{\partial B _{N}} = \frac{\partial C}{\partial X _{N}} \frac{\partial f ( Z _{N} )}{\partial Z _{N}},

\frac{\partial C}{\partial B _{j}} = (\frac{\partial C}{\partial X _{N}} \frac{\partial X _{N}}{\partial X _{N - 1}} \dots \frac{\partial X _{j + 1}}{\partial X _{j}}) \frac{\partial f ( Z _{j} )}{\partial Z _{j}} .

\frac{\partial C}{\partial B _{j}} = (\frac{\partial C}{\partial X _{N}} \frac{\partial X _{N}}{\partial X _{N - 1}} \dots \frac{\partial X _{j + 1}}{\partial X _{j}}) \frac{\partial f ( Z _{j} )}{\partial Z _{j}} .

\frac{\partial C}{\partial X _{N}^{R}} = - 2 Re (\tilde{X}_{N} P), \frac{\partial C}{\partial X _{N}^{I}} = - 2 Im (\tilde{X}_{N} P),

\frac{\partial C}{\partial X _{N}^{R}} = - 2 Re (\tilde{X}_{N} P), \frac{\partial C}{\partial X _{N}^{I}} = - 2 Im (\tilde{X}_{N} P),

\frac{\partial C}{\partial B _{N}^{R}} = \frac{\partial C}{\partial X _{N}^{R}} \frac{\partial X _{N}^{R}}{\partial B _{N}^{R}} = - 2 Re (\tilde{X}_{N} P) sech^{2} (Z_{N}^{R})

\frac{\partial C}{\partial B _{N}^{R}} = \frac{\partial C}{\partial X _{N}^{R}} \frac{\partial X _{N}^{R}}{\partial B _{N}^{R}} = - 2 Re (\tilde{X}_{N} P) sech^{2} (Z_{N}^{R})

\frac{\partial C}{\partial B _{N}^{I}} = \frac{\partial C}{\partial X _{N}^{I}} \frac{\partial X _{N}^{I}}{\partial B _{N}^{I}} = - 2 Im (\tilde{X}_{N} P) sech^{2} (Z_{N}^{I}) .

\frac{\partial C}{\partial B _{j}^{R}} =

\frac{\partial C}{\partial B _{j}^{R}} =

\frac{\partial C}{\partial B _{j}^{I}} =

\frac{\partial C}{\partial W _{j}^{R}} = \frac{\partial C}{\partial B _{j}^{R}} (X_{j - 1}^{R})^{T}, \frac{\partial C}{\partial W _{j}^{I}} = \frac{\partial C}{\partial B _{j}^{I}} (X_{j - 1}^{I})^{T} .

\frac{\partial C}{\partial W _{j}^{R}} = \frac{\partial C}{\partial B _{j}^{R}} (X_{j - 1}^{R})^{T}, \frac{\partial C}{\partial W _{j}^{I}} = \frac{\partial C}{\partial B _{j}^{I}} (X_{j - 1}^{I})^{T} .

\frac{\partial G}{\partial X _{N}^{R}} =

\frac{\partial G}{\partial X _{N}^{R}} =

\frac{\partial G}{\partial X _{N}^{I}} =

\frac{\partial G}{\partial B _{N}^{R}} = \frac{\partial G}{\partial X _{N}^{R}} \frac{\partial X _{N}^{R}}{\partial B _{N}^{R}} =

\frac{\partial G}{\partial B _{N}^{R}} = \frac{\partial G}{\partial X _{N}^{R}} \frac{\partial X _{N}^{R}}{\partial B _{N}^{R}} =

\frac{\partial G}{\partial B _{N}^{I}} = \frac{\partial G}{\partial X _{N}^{I}} \frac{\partial X _{N}^{I}}{\partial B _{N}^{I}} =

\frac{\partial G}{\partial W _{j}^{R}} = \frac{\partial G}{\partial B _{j}^{R}} (X_{j - 1}^{R})^{T}, \frac{\partial G}{\partial W _{j}^{I}} = \frac{\partial G}{\partial B _{j}^{I}} (X_{j - 1}^{I})^{T} .

\frac{\partial G}{\partial W _{j}^{R}} = \frac{\partial G}{\partial B _{j}^{R}} (X_{j - 1}^{R})^{T}, \frac{\partial G}{\partial W _{j}^{I}} = \frac{\partial G}{\partial B _{j}^{I}} (X_{j - 1}^{I})^{T} .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsQuantum many-body systems · Cold Atom Physics and Bose-Einstein Condensates · Quantum, superfluid, helium dynamics

Full text

Finding Quantum Many-Body Ground States

with Artificial Neural Network

Jiaxin Wu

[email protected]

Wenjuan Zhang

Department of Physics, the Ohio State University, Columbus, Ohio 43210, USA.

Abstract

Solving ground states of quantum many-body systems has been a long-standing problem in condensed matter physics.Here, we propose a new unsupervised machine learning algorithm to find the ground state of a general quantum many-body system utilizing the benefits of artificial neural network. Without assuming the specific forms of the eigenvectors, this algorithm can find the eigenvectors in an unbiased way with well controlled accuracy. As examples, we apply this algorithm to 1D Ising and Heisenberg models, where the results match very well with exact diagonalization.

††preprint: APS/123-QED

I Introduction

Machine learning (ML) has become a popular topic in physics, since its ingenuity and flexibility unprecedentedly allow computers to learn automatically about the underlying physics from the input data. Previous efforts have been carried out in designing various machine learning algorithms to study different physics problems, such as using Boltzmann machine to model thermodynamic observables [1], and using artificial neural network to study two-body scattering with short-range potentials [2]. In addition, in order to identify quantum phase transitions, Wetzel [3] utilize principal component analysis and variational autoencoder, meanwhile Broecker et al. [4] proposed a method using convolutional neural networks.

In condensed matter physics, one of the main hurdles is to find the eigenstates of interacting systems with reasonable system sizes. In the presence of interactions, the dimension of the Hilbert space grows exponentially with the system size, which prevents us from solving the Hamiltonian exactly except for small systems with exact diagonalization (ED). To solve this problem, many numerical methods have been put forward, such as quantum Monte Carlo (QMC) [5, 6, 7], density matrix renormalization group (DMRG) [8, 9, 10, 11], and so on. Everyone of them has its own advantages and constraints. For example, QMC uses stochastic sampling based on a probability that is related to the partition function of the system. It is a powerful algorithm to simulate thermodynamic properties of many-body systems, and the many-body ground states can be deduced from finite-size scaling. However, QMC suffers from the notorious “sign problem” rendering it unfit for some fermionic models. DMRG, on the other hand, is able to obtain accurate results for general large 1D or quasi-1D systems by keeping the most relevant components in local reduced density matrices, yet it performs poorly for higher dimensional systems. Variants of DMRG, Projected Entangled Pair States (PEPS) and multi-scale entanglement renormalization ansatz (MERA), have been proposed as solutions for the dimensional limitation. However DMRG and its variants use power law methods to diagonalize large matricies, redering the algorithm inefficient beyond a model dependent cut off.

With new tools provided by machine learning, we want to ask whether there exist new algorithms in finding the eigenstates of a general Hamiltonian. Recently, Carleo and Troyer [12] proposed to use restricted Boltzmann machine to find a variational ground state and its time evolution for a given Hamiltonian. In this paper, we address the problem by exploring an alternative unsupervised machine learning method with artificial neural network. Our goal is to find the ground-state eigenvectors of a general Hamiltonian without any assumptions of the form the eigenvectors.

The structure of this paper is organized as the following. We describe our machine learning algorithm in Section II. To illustrate how well this method works, we apply it to find the ground states of 1D Ising model and Heisenberg model, and compare the spin-spin correlators and ground-state energy with exact diagonalization in Section III. In the end, We conclude that using this method one is able to find the ground states accurately, and discuss how this method can potentially be applied to large systems beyond ED.

II Neural Network Structure

II.1 Traditional Deep Learning Structure

Let us start by introducing the structure of a deep learning artificial NN. Traditionally, A deep learning NN is constructed by an input layer, multiple hidden layers, and a output layer (See Fig. 1(a).). The input layer $X_{0}$ usually contains information of the input data, such as a vector describing every pixel of a picture. A hidden layer $X_{j}\ (j=1,2,3,\cdots,L-1)$ is a vector with arbitrary dimension, where every element is connected with every other element from the previous and next layer. The $X_{j}$ ’s are connected through parameters $W_{j}$ ’s and $B_{j}$ ’s:

[TABLE]

where $W_{j}$ is a matrix and $B_{j}$ is a vector. $f$ is a non-linear function called activation function which introduces non-linearity in the model. The notation here means that the activation function is acting on every element of the vector $(W_{j+1}X_{j}+B_{j+1})$ . The role of $f$ is important, because it is likely that $X_{L}$ and $X_{0}$ cannot be related by linear operations. Without the non-linearity from the activation function, a multi-layer NN structure has no difference from having only the input and output layers. The common choices of $f$ are sigmoid, tanh, etc. Within the same layer, the elements are independent of each other. An output layer $X_{L}$ can be a number or a vector containing output information from the NN. The task of a NN is to map the input $X_{0}$ to an output $X_{L}$ where $X_{L}$ is as close to the desired output $Y$ as possible.

In order to accomplish the task, one needs to have a well defined cost function, which guides the direction of the learning process. It usually quantifies the difference between $X_{L}$ and $Y$ . As one of the simplest examples, the cost function could be defined as

[TABLE]

In this case, if $X_{L}$ is the same as $Y$ , the cost function $C=0$ . The more $X_{L}$ deviates from $Y$ , the larger $C$ is. Now the goal is to minimize $C$ by updating the parameters in the NN. Because the optimization depends on the knowledge of the desired outputs, which requires one to label the data beforehand, this is a typical supervised machine learning method. The initial parameters $W_{j}$ ’s are usually chosen randomly and the $B_{j}$ ’s are left to be zero vectors. One can optimize the output by repeatedly updating the parameters in the direction where $C$ decays the fastest. To be more precise,

[TABLE]

where the partial derivative of $C$ is operated with respect to every matrix or vector element, and the sign “ $\mathrel{\mathop{\mathchar 58\relax}}=$ ” denotes updating the left-hand side with the value on the right-hand side. $\eta$ is a positive number called the learning rate, and it determines how fast $C$ descends in every update. As an example shown in Fig. 1(b), when $\frac{\partial C}{\partial W_{j}}<0$ at point $W_{j}^{\prime}$ , the new value of $W_{j}$ is updated forward. While $\frac{\partial C}{\partial W_{j}}>0$ at point $W_{j}^{\prime\prime}$ , $W_{j}$ is updated backward. In this way, the parameters are always updated in the direction of minimizing $C$ the fastest.

For a given algorithm, one has the freedom to tune the number of layers $N$ , the number of nodes in every hidden layers, the learning rate $\eta$ . These are called hyper parameters. Changing these parameters can affect the performance of the algorithm. Therefore, one usually needs to have some trial runs to optimize the choice of hyper parameters.

II.2 Modified Neural Network to Find Eigenvectors

Based on the deep learning structure, we propose a modified machine learning algorithm to find the ground state of a general Hamiltonian. Here, the input layer $X_{0}$ is a vector indicating the initial guess of the eigenvector given Hamiltonian $H$ . Without prior knowledge of the eigenvetors, one can choose $X_{0}$ as a random vector. Noticed that the eigenvector for a general Hamiltonian is complex, $X_{0}$ is a complex vector, which can be decomposed as

[TABLE]

where superscript $R$ and $I$ mean the real and imaginary part respectively. To simplify the calculation, we propagate the real and imaginary part of $X$ through the NN separately, i.e.

[TABLE]

Here, all $W^{R/I}_{j}$ ’s and $B^{R/I}_{j}$ ’s are real, and we choose the activation function to be $\tanh$ . To test whether the output vector $X_{N}$ is an eigenvector of $H$ , we first normalize $X_{N}$ such that $X_{N}^{\dagger}\cdot X_{N}=1$ . Then we calculate

[TABLE]

where $\tilde{X}_{N}$ is also properly normalized, and $E$ is a scalar. When $X_{N}$ is an eigenvector, $\tilde{X}_{N}=X_{N}$ and $E$ is the eigenenergy. To quantify how close $X_{N}$ is to be an eigenvector, we define the cost function as

[TABLE]

When $X_{N}$ is close to be an eigenvector, $C\rightarrow 0$ ; whereas when $X_{N}$ is far from being an eigenvector, $C\rightarrow 1$ . The goal of the learning process is to minimize $C$ as close to zero as possible, which can be done with gradient descent (See appendix.A.) according to Eq.(3).

Noticed that for the activation function $f(x)=\tanh(x)$ , its derivative is the largest when $x=0$ , which means the machine learns the most effectively when $W_{j+1}^{R/I}X_{j}^{R/I}+B_{j+1}^{R/I}\rightarrow 0$ for all $j$ ’s. To speed up the learning process, we initialize all elements of $W_{j}$ ’s with normal distributed random numbers multiplying by a small number (usually $10^{-2}\sim 10^{-4}$ ), and zero all $B_{j}$ ’s. For large systems, initializing $W_{j}$ ’s with small matrix elements can greatly improve the learning speed and the accuracy of the output vectors.

The minimization process stops when $C<\epsilon_{c}$ , where $\epsilon_{c}$ is the threshold for the cost function. It is a small positive number with $\epsilon_{c}<1$ . The smaller $\epsilon_{c}$ is, the more accurate the output eigenstates are, and usually the more layers are required in the NN.

The cost function in Eq.(7) has many minimums, and each of them corresponds to an eigenvector of $H$ . However, the eigenvectors with an eigenenergy further away from 0 are more likely to be found. The reason is the following. Supposed that the $\{\Psi_{m}\}$ is the set of orthonormal eigenvectors for $H$ with eigenvalues $\{\mathcal{E}_{m}\}$ , one can then write $X_{N}$ as a superposition where $X_{N}=\sum_{m}A_{m}\Psi_{m}$ . In every iteration, we compute Eq.(6), which could be written as $H\cdot\sum_{m}\Psi_{m}=E\sum_{m}\frac{\mathcal{E}_{m}}{E}\Psi_{m}=E\tilde{X}_{N}$ . Under many iteration, the eigen vector correspond to the largest $\mathcal{E}_{j}$ is going to be more important than the other vectors by some powers of $\mathcal{E}_{j}$ . This idea is similar to the power iteration method. In physics, we are usually most interested in the few lowest energy states. To increase the probability that the output states are one of them, one can shift the energy levels down by subtracting $H$ with a constant such that the highest energy states have eigenenergy close to 0. Therefore, after shifting the energy levels, the form of the cost function in Eq.(7) determines that one will most likely finds the few eigenstates with the lowest eigenenergy. Since the definition of $C$ doesn’t require prior knowledge of the correct eigenvectors, this is an unsupervised machine learning algorithm.

II.3 Gain Function for Convergence to Ground States

The cost function introduced in Section II.2 is good when one is interested to see a few eigenstates with relatively low energy. However, in a lot of cases, we are only interested in the ground state. If $H$ has large dimensions with some low excited states, it could take quite some trials until one finds the ground state. With the cost function in Eq.(7), one way to find it is to project out every output eigenstate in $H$ before running the next trail, and compare all the eigenvalues from these different eigenstates to determine which one corresponds to the ground state. But this projection method requires every output eigenstate to be extremely accurate, otherwise the Hamiltonian will get mixed up. In this subsection, we introduce the gain function, as opposed to the cost function, to help find the ground state more directly.

The gain function is defined as

[TABLE]

After properly shifting the energy levels, $G$ is maximized only when $X_{N}$ is the ground state eigenvector. One can then use gradient ascent (See Appendix.B for more details.)

[TABLE]

to iteratively maximize $G$ . In other words, the gradient ascent process refines $X_{N}$ to be the ground state eigenvector.

The typical behaviors of $G$ and $C$ in the process of gradient ascent is shown in Fig. 2. While maximizing the gain function $G$ , the cost function $C$ is able to jump out of the minimums of some excited states and reach the ground state minimum.

Unlike the cost function having a universal minimum value, the maximum value for the gain function is model dependent, which equals to the ground state energy square. Without prior knowledge, it is sometimes difficult to guess it accurately. In this case, we stop the optimization process when $G$ increase slowly and $C$ is small. To be more precise, we keep the gain function value from the last optimization step and call it $G^{\prime}$ . The process stops when $0<\frac{G-G^{\prime}}{|G^{\prime}|}<\epsilon_{g}$ and $C<\epsilon_{c}$ , where $0<\epsilon_{g}<1$ is a small number acting as a threshold for the gain function increasing rate. The cost function is now used as a “quality control”. It prevents the optimization process from stopping at regions where $G$ is passing by saddle points but the output vector is not an eigenvector. In general, using the gain function can significantly reduce the number of steps needed for the learning process with the same cost function threshold $\epsilon_{c}$ . Moreover, for the same NN structure, using the gain function usually increases the accuracy of the output eigenvectors, i.e. smaller $\epsilon_{c}$ becomes achievable.

However, based on the stopping condition described above, there are still chances where the optimization stops before the ground state is found, namely when an excited state eigenvector also correspond to a local minimum or saddle point in $G$ . This is a legitimate concern, and one should either compare several outputs with different initial condition, or judge based on estimate of the ground state energy, to decide whether the output vector is the ground state. As shown in Fig. 3, however, the probability of getting a ground state eigenvector is rather high when using the gain function in the optimization process.

III Numerical Results

To validate the ML algorithm, we apply it to find the ground states of the 1D Ising model with staggered field and anti-ferromagnetic Heisenberg models, where the two Hamiltonian are given by

[TABLE]

respectively. $S_{j}^{\alpha}=\frac{1}{2}\sigma_{j}^{\alpha}$ is the spin- $\frac{1}{2}$ operator of the $j$ th site with $\sigma_{j}^{\alpha}$ ’s being the Pauli matrices, and $L$ is the total number of sites in each system. In both cases, we use periodic boundary condition and shift the energy levels by adding a constant. For the Ising model, we also add a small staggered field ( $h=0.1$ in our case) to split the ground state degeneracy for more straight forward comparison between the exact ground state and the output ground state. In principle, the machine learning algorithm works even if there is degeneracy, where the output ground state should be in one of the superposition of the degenerate ground states. However, the ground states found with different initial conditions are not necessarily orthogonal to each other. One way to find all the ground states is to project out the ground states from the previous calculation in the Hamiltonian before running the algorithm. However, this procedure requires every ground state is obtained with very high accuracy, otherwise the Hamiltonian will be mixed up. To avoid this complicated situation in our demonstration, we choose models without ground state degeneracy.

In Fig. 3, we compare the neural network’s ability of finding the ground states using the cost function versus the gain function. Under many trials of random initial condition, let $N_{g}$ be the number of ground states found, and $N_{tot}$ be the total number of eigenstates found. We define the acceptance rate of ground states as $\frac{N_{g}}{N_{tot}}$ . When the systems are relatively small, both methods give rather high acceptance rate. As the system size grows, the dimension of the Hilbert space grows exponentially, and it becomes less likely to find the ground states by minimizing the cost function. However, the gain function remains relatively effective as a guidance to find the ground states. Noticed that the acceptance rate also depend on the hyper parameters, so this figure only provide a qualitative ratio.

Next, we show the ground-state spin-spin correlations and energy calculated from maximizing the gain function versus ED in Fig. 4 and Fig. 5. Plotted values for ML are the mean value of 100 output ground states with different random initial condition, and the error bars indicate the standard deviations in this ensemble. Fig. 4 and Fig. 5 show that the results from the ML algorithm match the ones from ED quite well.

IV Conclusion

In conclusion, we introduce a new machine learning algorithm to find the ground state eigenvector of a general Hamiltonian based on artificial neural network. This method does not have any constraints on the form of the Hamiltonian, nor does it require any prior knowledge of the target ground state. Moreover, the results are obtained with a controllable error rate. Therefore, the outputs are unbiased and can be made very accurate. Compare to ED, this algorithm does not involve solving any multivariable equations, but only matrix multiplications. As a result, one potential direction of applying it to large systems is to store the large matrices in hard drives and read in a few at a time for the matrix multiplication. Besides, the dimension of the matrices in the hidden layer can likely be reduced especially in the presence of symmetries in the Hamiltonian. In our discussion, we fix the dimension of the matrices in the hidden layers in order to reduce the number of hyper parameters, but there is no clear reason that this has to be the case. Reducing the dimension of these matrices can not only reduce the cost of memory, but also improve the speed of the algorithm. The minimum dimension may even reveal information about the “order parameter” of the Hamiltonian. Future work is needed to further explore these two possibilities.

Acknowledgements.

We thank Tin-Lun Ho, Niravkumar Patel, James Rowland and Wayne Zheng for illuminating discussion. JW further acknowledges support from MURI Grant FP054294-D, the NASA Grant on Fundamental physics 1541824, and the OSU MRSEC Seed Grant. WZ acknowledges support from the National Science Foundation Grant No. DMR-1629382.

Appendix A Gradient Descent of Cost Function

To update the parameters $W_{j}$ ’s and $B_{j}$ ’s, one needs to find the partial derivatives $\frac{\partial C}{\partial W_{j}}$ and $\frac{\partial C}{\partial B_{j}}$ . A simple way to do it is to first calculate the partial derivatives with respect to the last layer

[TABLE]

where $f$ is the activation function, and $Z_{j}=W_{j}X_{j-1}+B_{j}$ . In our calculation, we choose $f(z)=\tanh(z)$ . Notice that for $\tanh(z)$ , its derivative is the steepest at $z=0$ , which means that the machine learns the fastest when the value of $Z_{j}$ ’s are small. So when one initialize the parameters $W_{j}$ ’s, it is beneficial to choose small random numbers, and let $B_{j}$ ’s to be zero. Using the above result, one can calculate the partial derivative of the second to last layer and so on. For the $j$ th layer,

[TABLE]

Once $\frac{\partial C}{\partial B_{j}}$ is known, $\frac{\partial C}{\partial W_{j}}=\frac{\partial C}{\partial B_{j}}X_{j-1}$ . Repeatedly, one can update the parameters from the last layer to the first layer, and finish one updating process. This procedure is called back propagation.

With the definition of cost function C in Eq.(7),

[TABLE]

where $\mathcal{P}$ is a scalar defined as $\mathcal{P}\equiv{\tilde{X}_{N}^{\dagger}\cdot X_{N}}$ . In our algorithm, the real and imaginary parts are updated separately, i.e.

[TABLE]

where $Z^{R/I}_{j}=W^{R/I}_{j}X^{R/I}_{j-1}+B^{R/I}_{j}$ , and $\operatorname{sech}$ is obtained from the fact that we use $\tanh$ as the activation function. For $j\in[1,N)$ , we calculate $\frac{\partial C}{\partial B^{R/I}_{j}}$ based on the chain rule in Eq.(12), except that the real and imaginary parts are separate, i.e.

[TABLE]

As for $W_{j}$ ’s, we have

[TABLE]

To minimize $C$ to zero, we update the parameters $W_{j}$ ’s and $B_{j}$ ’s using gradient descent based on Eq.(3) until $C$ is smaller than a threshold $\epsilon_{c}$ , where $\epsilon_{c}<<1$ .

Appendix B Gradient Ascent of Gain Function

Similarly to gradient descent, we need to calculate the partial derivatives of $\frac{\partial G}{\partial W_{j}}$ and $\frac{\partial G}{\partial B_{j}}$ for each update for $G$ ’s gradient ascent. We continue to use the idea of back propagation described in Appendix.A.

For the gain function defined in Eq.(8),

[TABLE]

therefore,

[TABLE]

For $1\leq j<N$ , $\frac{\partial G}{\partial B_{j}}$ can be obtained through chain rule similar to Eq.(15) while separating the real and imaginary parts. For $W_{j}$ ’s,

[TABLE]

Since we want to maximize $G$ , we need to update the parameters in a opposite direction compared to $C$ , i.e. following Eq.(9) with a positive learning rate $\eta$ . Furthermore, the upper bound of $G$ depends on the Hamiltonian $H$ , so that we cannot define a general threshold at which the gradient ascent stops. Alternatively, we calculate the gain function from the current step $G_{c}$ and from the last step $G_{l}$ , and the gradient ascent stops when the gain function almost stop increasing at consecutive steps while the cost function is under the threshold, i.e. $(G_{c}-G_{l})/G_{l}<\epsilon_{g}$ with $(G_{c}-G_{l})>0$ and $C<\epsilon_{c}$ . Here, $\epsilon_{g}$ is the threshold for the increased ratio of the gain function and $\epsilon_{g}<<1$ .

Bibliography12

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Torlai and Melko [2016] G. Torlaiand R. G. Melko, Learning thermodynamics with boltzmann machines, Physical Review B 94 , 165134 (2016).
2Wu et al. [2018] Y. Wu, P. Zhang, H. Shen, and H. Zhai, Visualizing a neural network that develops quantum perturbation theory, Physical Review A 98 , 010701 (2018).
3Wetzel [2017] S. J. Wetzel, Unsupervised learning of phase transitions: From principal component analysis to variational autoencoders, Physical Review E 96 , 022140 (2017).
4Broecker et al. [2017] P. Broecker, F. F. Assaad, and S. Trebst, Quantum phase recognition via unsupervised machine learning, ar Xiv preprint ar Xiv:1707.00663 (2017).
5Foulkes et al. [2001] W. Foulkes, L. Mitas, R. Needs, and G. Rajagopal, Quantum monte carlo simulations of solids, Reviews of Modern Physics 73 , 33 (2001).
6Suzuki [1993] M. Suzuki, Quantum Monte Carlo methods in condensed matter physics (World scientific, 1993).
7Ceperley [1995] D. M. Ceperley, Path integrals in the theory of condensed helium, Reviews of Modern Physics 67 , 279 (1995).
8White [1992] S. R. White, Density matrix formulation for quantum renormalization groups, Physical review letters 69 , 2863 (1992).