General-purpose quantum circuit simulator with Projected Entangled-Pair   States and the quantum supremacy frontier

Chu Guo; Yong Liu; Min Xiong; Shichuan Xue; Xiang Fu; Anqi Huang,; Xiaogang Qiang; Ping Xu; Junhua Liu; Shenggen Zheng; He-Liang Huang; Mingtang; Deng; Dario Poletti; Wan-Su Bao; Junjie Wu

arXiv:1905.08394·quant-ph·November 7, 2019

General-purpose quantum circuit simulator with Projected Entangled-Pair States and the quantum supremacy frontier

Chu Guo, Yong Liu, Min Xiong, Shichuan Xue, Xiang Fu, Anqi Huang,, Xiaogang Qiang, Ping Xu, Junhua Liu, Shenggen Zheng, He-Liang Huang, Mingtang, Deng, Dario Poletti, Wan-Su Bao, Junjie Wu

PDF

TL;DR

This paper introduces a versatile quantum circuit simulator based on Projected Entangled-Pair States, capable of modeling 2D quantum systems and assessing the quantum supremacy frontier by efficiently computing amplitudes of large qubit lattices.

Contribution

It presents a novel application of PEPS algorithms as a general-purpose quantum circuit simulator for 2D systems, enabling precise resource estimation and amplitude computation.

Findings

01

Simulated a 7x7 qubit lattice with depth 42 in 31 minutes.

02

Used less than 93 TB memory on Tianhe-2 supercomputer.

03

Demonstrated the simulator's capability to explore quantum supremacy boundaries.

Abstract

Recent advances on quantum computing hardware have pushed quantum computing to the verge of quantum supremacy. Random quantum circuits are outstanding candidates to demonstrate quantum supremacy, which could be implemented on a quantum device that supports nearest-neighbour gate operations on a two-dimensional configuration. Here we show that using the Projected Entangled-Pair States algorithm, a tool to study two-dimensional strongly interacting many-body quantum systems, we can realize an effective general-purpose simulator of quantum algorithms. This technique allows to quantify precisely the memory usage and the time requirements of random quantum circuits, thus showing the frontier of quantum supremacy. With this approach we can compute the full wave-function of the system, from which single amplitudes can be sampled with unit fidelity. Applying this general quantum circuit…

Tables1

Table 1. Table 1: Large-scale simulation with PEPS based circuit simulator. The column denoted by “Node usage” indicates the number of cores used divided by the total available on Tianhe-2, and the corresponding percentage. “Qubits” and “Depth” describe the circuit analyzed while “Elapsed time” shows the time required to compute one amplitude.

Node usage	Qubits	Depth	Elapsed time
4096/17920, 22%	$7 \times 7$	(1+39+1)	9 min
	$7 \times 7$	(1+40+1)	31 min
	$8 \times 8$	(1+37+1)	68 min
2048/17920, 11%	$9 \times 9$	(1+31+1)	22 min
1024/17920, 5%	$10 \times 10$	(1+26+1)	9 min

Equations39

∣ ψ ⟩ = σ_{1}, \dots, σ_{N} \sum F (A_{1}^{σ_{1}} A_{2}^{σ_{2}} \dots A_{N}^{σ_{N}}) ∣ σ_{1}, σ_{2}, \dots, σ_{N} ⟩,

∣ ψ ⟩ = σ_{1}, \dots, σ_{N} \sum F (A_{1}^{σ_{1}} A_{2}^{σ_{2}} \dots A_{N}^{σ_{N}}) ∣ σ_{1}, σ_{2}, \dots, σ_{N} ⟩,

χ = max {dim (l), dim (r), dim (u), dim (d)},

χ = max {dim (l), dim (r), dim (u), dim (d)},

[A_{n}^{' τ_{n}}]_{l, r, u, d} = σ_{n} \sum U_{σ_{n}}^{τ_{n}} [A_{n}^{σ_{n}}]_{l, r, u, d} .

[A_{n}^{' τ_{n}}]_{l, r, u, d} = σ_{n} \sum U_{σ_{n}}^{τ_{n}} [A_{n}^{σ_{n}}]_{l, r, u, d} .

SVD (O_{σ_{n}, σ_{m}}^{τ_{n}, τ_{m}}) = s \sum U_{σ_{n}, s}^{τ_{n}} V_{s, σ_{m}}^{τ_{m}},

SVD (O_{σ_{n}, σ_{m}}^{τ_{n}, τ_{m}}) = s \sum U_{σ_{n}, s}^{τ_{n}} V_{s, σ_{m}}^{τ_{m}},

[A_{n}^{' τ_{n}}]_{l, r^{'}, u, d}

[A_{n}^{' τ_{n}}]_{l, r^{'}, u, d}

[A_{m}^{' τ_{m}}]_{l^{'}, r, u, d}

⟨ τ ∣ ψ ⟩ = F (E_{1} E_{2} \dots E_{N}),

⟨ τ ∣ ψ ⟩ = F (E_{1} E_{2} \dots E_{N}),

χ \leq 2^{⌈ d /8 ⌉},

χ \leq 2^{⌈ d /8 ⌉},

C^{s} (L_{v} \times L_{h} \times d)

C^{s} (L_{v} \times L_{h} \times d)

C^{t} (L_{v} \times L_{h} \times d)

⟨ τ ∣ ψ ⟩ = F (E_{1} E_{2} \dots E_{N}) .

⟨ τ ∣ ψ ⟩ = F (E_{1} E_{2} \dots E_{N}) .

F_{1}^{d_{(1, 1)}, d_{(1, 2)}, \dots, d_{(1, L_{h})}} = F ([E_{(1, 1)}]_{d_{(1, 1)}} \dots [E_{(1, L_{h})}]_{d_{(1, L_{h})}}),

F_{1}^{d_{(1, 1)}, d_{(1, 2)}, \dots, d_{(1, L_{h})}} = F ([E_{(1, 1)}]_{d_{(1, 1)}} \dots [E_{(1, L_{h})}]_{d_{(1, L_{h})}}),

G_{1}^{r_{(2, 1)}, d_{(2, 1)}, d_{(1, 2)}, \dots, d_{(1, L_{h})}} = d_{(1, 1)} \sum F_{1}^{d_{(1, 1)}, \dots, d_{(1, L_{h})}} \times [E_{(2, 1)}]_{r_{(2, 1)} d_{(1, 1)} d_{(2, 1)}},

G_{1}^{r_{(2, 1)}, d_{(2, 1)}, d_{(1, 2)}, \dots, d_{(1, L_{h})}} = d_{(1, 1)} \sum F_{1}^{d_{(1, 1)}, \dots, d_{(1, L_{h})}} \times [E_{(2, 1)}]_{r_{(2, 1)} d_{(1, 1)} d_{(2, 1)}},

G_{2}^{r_{(2, 2)}, d_{(2, 1)}, d_{(2, 2)}, \dots, d_{(1, L_{h})}} = r_{(2, 1)}, d_{(1, 2)} \sum G_{1}^{r_{(2, 1)}, d_{(2, 1)}, d_{(1, 2)}, \dots, d_{(1, L_{h})}} [E_{22}]_{r_{(2, 1)} r_{(2, 2)} d_{(1, 2)} d_{(2, 2)}},

G_{2}^{r_{(2, 2)}, d_{(2, 1)}, d_{(2, 2)}, \dots, d_{(1, L_{h})}} = r_{(2, 1)}, d_{(1, 2)} \sum G_{1}^{r_{(2, 1)}, d_{(2, 1)}, d_{(1, 2)}, \dots, d_{(1, L_{h})}} [E_{22}]_{r_{(2, 1)} r_{(2, 2)} d_{(1, 2)} d_{(2, 2)}},

F_{2}^{d_{(2, 1)}, d_{(2, 2)}, \dots, d_{(2, L_{h})}} = G_{L_{h}}^{r_{(2, L_{h})}, d_{(2, 1)}, d_{(2, 2)}, \dots, d_{(2, L_{h})}}

F_{2}^{d_{(2, 1)}, d_{(2, 2)}, \dots, d_{(2, L_{h})}} = G_{L_{h}}^{r_{(2, L_{h})}, d_{(2, 1)}, d_{(2, 2)}, \dots, d_{(2, L_{h})}}

⟨ τ ∣ ψ ⟩ = F_{L} .

⟨ τ ∣ ψ ⟩ = F_{L} .

C^{s} (L_{v} \times L_{h} \times d)

C^{s} (L_{v} \times L_{h} \times d)

C^{t} (L_{v} \times L_{h} \times d)

C^{s} (L_{v} \times L_{h} \times d)

C^{s} (L_{v} \times L_{h} \times d)

C^{t} (L_{v} \times L_{h} \times d)

C^{s} (L_{v} \times L_{h} \times d)

C^{s} (L_{v} \times L_{h} \times d)

C^{t} (L_{v} \times L_{h} \times d)

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

††thanks: These authors contribute equally to this work.††thanks: These authors contribute equally to this work.

General-purpose quantum circuit simulator with Projected Entangled-Pair States and the quantum supremacy frontier

Chu Guo

Henan Key Laboratory of Quantum Information and Cryptography, SSF IEU, Zhengzhou 450001, China

Yong Liu

Institute for Quantum Information & State Key Laboratory of High Performance Computing, College of Computer, National University of Defense Technology, Changsha 410073, China

Min Xiong

Institute for Quantum Information & State Key Laboratory of High Performance Computing, College of Computer, National University of Defense Technology, Changsha 410073, China

Shichuan Xue

Institute for Quantum Information & State Key Laboratory of High Performance Computing, College of Computer, National University of Defense Technology, Changsha 410073, China

Xiang Fu

Institute for Quantum Information & State Key Laboratory of High Performance Computing, College of Computer, National University of Defense Technology, Changsha 410073, China

Anqi Huang

Institute for Quantum Information & State Key Laboratory of High Performance Computing, College of Computer, National University of Defense Technology, Changsha 410073, China

Xiaogang Qiang

Institute for Quantum Information & State Key Laboratory of High Performance Computing, College of Computer, National University of Defense Technology, Changsha 410073, China

Ping Xu

Institute for Quantum Information & State Key Laboratory of High Performance Computing, College of Computer, National University of Defense Technology, Changsha 410073, China

Junhua Liu

Information Systems Technology and Design, Singapore University of Technology and Design, 8 Somapah Road, 487372 Singapore

Quantum Intelligence Lab (QI-Lab), Supremacy Future Technologies (SFT), Guangzhou 511340, China

Shenggen Zheng

Center for Quantum Computing, Peng Cheng Laboratory, Shenzhen 518055, China

He-Liang Huang

Henan Key Laboratory of Quantum Information and Cryptography, SSF IEU, Zhengzhou 450001, China

Hefei National Laboratory for Physical Sciences at Microscale and Department of Modern Physics,

University of Science and Technology of China, Hefei, Anhui 230026, China

CAS Centre for Excellence and Synergetic Innovation Centre in Quantum Information and Quantum Physics,

University of Science and Technology of China, Hefei, Anhui 230026, China

Mingtang Deng

Institute for Quantum Information & State Key Laboratory of High Performance Computing, College of Computer, National University of Defense Technology, Changsha 410073, China

Dario Poletti

[email protected]

Science and Math Cluster and EPD Pillar, Singapore University of Technology and Design, 8 Somapah Road, 487372 Singapore

Wan-Su Bao

[email protected]

Henan Key Laboratory of Quantum Information and Cryptography, SSF IEU, Zhengzhou 450001, China

CAS Centre for Excellence and Synergetic Innovation Centre in Quantum Information and Quantum Physics,

University of Science and Technology of China, Hefei, Anhui 230026, China

Junjie Wu

[email protected]

Institute for Quantum Information & State Key Laboratory of High Performance Computing, College of Computer, National University of Defense Technology, Changsha 410073, China

Abstract

Recent advances on quantum computing hardware have pushed quantum computing to the verge of quantum supremacy. Here we bring together many-body quantum physics and quantum computing by using a method for strongly interacting two-dimensional systems, the Projected Entangled-Pair States, to realize an effective general-purpose simulator of quantum algorithms. The classical computing complexity of this simulator is directly related to the entanglement generation of the underlying quantum circuit rather than the number of qubits or gate operations. We apply our method to study random quantum circuits, which allows to quantify precisely the memory usage and the time requirements of random quantum circuits. We demonstrate our method by computing one amplitude for a $7\times 7$ lattice of qubits with depth $(1+40+1)$ on the Tianhe-2 supercomputer.

Quantum computers offer the promise of efficiently solving certain problems that are intractable for classical computers, most famously factorizing large numbers Feynman (1982); Shor (1994); Boixo et al. (2018). With the rapid progress of various quantum systems towards Noisy Intermediate-Scale Quantum computing devices Lund et al. (2017); Huang et al. (2017); Zhang et al. (2017); Huang et al. (2018); Wright et al. (2019); Kelly et al. (2019); Gong et al. (2019); Wang et al. (2018), we are now on the verge of quantum supremacy Preskill (2012), i.e. demonstrating that a quantum computer has the ability to do a computation that no classical computers can tackle, an important milestone in the field of computer science. Various candidates have been suggested to demonstrate quantum supremacy, such as BosonSampling Aaronson and Arkhipov (2011); Wu et al. (2018), the instantaneous quantum polynomial protocol Shepherd and Bremner (2009); Bremner et al. (2010) and random quantum circuits (RQCs) Boixo et al. (2018); Bouland et al. (2018) which demand less physical resources and are easier to implement compared to, for instance, factorization.

A central aspect for all these near-term supremacy proof-of-principle computations is to produce a quantum state using as fewer number of qubits as well as quantum gate operations as possible, which would nevertheless be highly entangled and hence difficult to obtain and/or characterize by a classical computer, for instance by sampling from it in the computational basis. In the meanwhile, it is important to find effective ways to simulate accurately quantum algorithms on classical computers, which could be used as a benchmarking baseline and to validate near term quantum devices. In the field of quantum many-body physics, tensor network states are often used to efficiently represent quantum states with a sizeable amount of entanglement Eisert (2013); Orús (2014). The storage required by these tensor network states is closely related to the amount of entanglement of the quantum state. Recently, matrix product states (which are one-dimensional tensor networks) have been applied to simulate quantum circuits McCaskey et al. (2018). However, the performance of matrix product states is much less effective if the underlying quantum system is essentially two-dimensional. In this work, we present an efficient and generic quantum circuit simulator based on the Projected Entangled-Pair States (PEPS) Verstraete and Cirac (2004); Verstraete et al. (2006); Murg et al. (2007); Jordan et al. (2008); Gu et al. (2008); Jiang et al. (2008); Xie et al. (2009); Murg et al. (2009); Orús (2014), a type of tensor-network quantum states representation designed for two-dimensional lattices. Our PEPS-based simulator is a general-purpose quantum circuit simulator for arbitrary quantum circuits: it stores the full quantum state and it can be readily used to compute single amplitudes, observables, and also perform sequences of quantum measurements.

While the quantum circuit simulator we present can tackle generic circuits, in the following we focus on RQCs. They consist of a series of single and two-qubit gates which are applied to different qubits in a particular order. A group of commuting gates, which can be applied simultaneously, constitutes one layer of the circuit, and the more groups of operations that do not commute, the deeper the circuit is. More precisely, for the depth of a circuit we will use the notation $(1+d+1)$ where the ${}^{\prime}1^{\prime}s$ indicate the Hadamard gates applied to each site at the beginning and at the end of the calculations, while $d$ is the number of non-commuting layers including controlled-Z (CZ) gates and single qubit gates applied to different sites. RQCs are the standard benchmark for quantum supremacy as put forth by Boixo et al. (2018). The general complexity of quantum supremacy experiments is studied in Aaronson and Chen (2016). For RQCs, it was previously shown in Boixo et al. (2017) that the complexity scales exponentially with $\min(O(dL_{h}),O(N))$ .

RQCs have thus stimulated the search for efficient classical algorithms which would show where exactly the limits of classical simulations are Häner and Steiger (2017); Boixo et al. (2017); Chen et al. (2018); Bouland et al. (2018); Li et al. (2018); Pednault et al. (2017); Chen et al. (2019); Markov et al. (2018); Villalonga et al. (2018, 2019). State of the art algorithms can be mainly divided into two categories. i) State-vector approach which stores the quantum state as a vector and evolves it directly. For example, in Häner and Steiger (2017) a 45-qubit simulation is reported based on this approach. However, this approach is limited by the number of qubits due to the exponential growth of the Hilbert space. ii) Tensor-based approach, which represents the quantum states as tensors and specifies the input and output states as rank-1 Kronecker projectors. This approach is less sensitive to the number of qubits and has been pursued more actively. For instance, a full amplitude simulation of a $7\times 7$ circuit to depth $(1+39+1)$ was implemented in $4.2$ hours on Sunway TaihuLight supercomputer Li et al. (2018), which however exploits the weakness in the original design of RQCs in Boixo et al. (2018). Recently, it was proposed to trade circuit fidelity for computational efficiency so as to match the fidelity of a given quantum computer Markov et al. (2018); Villalonga et al. (2018), and practically compute around $1$ million amplitudes of a $7\times 7$ circuit to depth $(1+40+1)$ with $0.5\%$ circuit fidelity in $2.44$ hours on Summit supercomputer Villalonga et al. (2019). Our approach differs from the above approaches in that we use PEPS as the data structure to represent the quantum states. Quantum gate operations as well as quantum projections are adapted accordingly to this new data structure.

Quantum Circuit Simulator Based on PEPS. In the following we consider a two-dimensional rectangular lattice of size $L_{v}\times L_{h}$ , where $L_{v}$ and $L_{h}$ are, respectively, the sizes in the vertical and horizontal directions. We use $N=L_{v}L_{h}$ to denote the total number of qubits. The quantum state on such a lattice can be represented as a PEPS Verstraete and Cirac (2004); Murg et al. (2007); Jordan et al. (2008)

[TABLE]

where ${\mathbf{A}}_{n}^{\sigma_{n}}$ is a rank- $5$ tensor with elements $\left[{\mathbf{A}}_{n}^{\sigma_{n}}\right]_{l,r,u,d}$ at site $n$ , with $\sigma=0,1$ corresponding to the physical dimension, and $l,r,u,d$ corresponding to the left, right, up and down auxiliary dimensions, see Fig. 1(a). The function $\mathcal{F}$ in Eq. (1) indicates the sum over the common auxiliary indices. The bond dimension $\chi$ is defined as the maximum size of the four auxiliary dimensions,

[TABLE]

and it characterizes the size of the PEPS.

In the language of PEPS, a single-qubit gate operation $U^{\tau_{n}}_{\sigma_{n}}$ on site $n$ only operates locally on the $n$ -th tensor ${\mathbf{A}}_{n}^{\sigma_{n}}$ (shown in Fig. 1(b)), which can be written as

[TABLE]

As we can see from Eq. (3), the size of the local tensor is not affected by a single-qubit gate operation. For a two-qubit gate acting on a horizontally nearest-neighbour pair of qubits $(n,m)$ (shown in Fig. 1(c)), denoted as $O_{\sigma_{n},\sigma_{m}}^{\tau_{n},\tau_{m}}$ , we first use a by singular value decomposition (SVD) to factorize it into a product of two local tensors

[TABLE]

where the singular values have been absorbed into $U$ . The size of the auxiliary dimension $s$ is denoted as $\chi_{o}$ , which, for any two-qubit controlled gate, is $\chi_{o}=2$ . The two local tensors $U$ and $V$ are then applied on the two qubits $n$ and $m$ separately, as single-qubit gate operations

[TABLE]

Here we haved used the indices $r^{\prime}=(r,s)$ , $l^{\prime}=(s,l)$ , which bundles the two tensor dimensions into one. As a result, $\chi$ increases by a factor of $\chi_{o}$ . To keep $\chi$ in a affordable size, one would usually use a subsequent singular value decomposition to compress the resulting tensors by throwing away singular values below a suitably chosen threshold. However, we point out that for RQCs we cannot perform such a compression because the distribution of the singular values after the two-qubit gate operation is almost flat, making it impossible for compression (this is also an indication that this problem has large entanglement across the whole circuit). Calculating a single amplitude of the final state $|\psi\rangle$ is done by projecting $|\psi\rangle$ onto a separable PEPS which encodes one spin configuration $|\vec{\tau}\rangle$ , and then contracting the resulting tensor network, which can be written as

[TABLE]

where the rank- $4$ tensor $\left[{\mathbf{E}}_{n}\right]_{l,r,u,d}=\left[{\mathbf{A}}_{n}^{\sigma_{n}=\tau_{n}}\right]_{l,r,u,d}$ . This calculations are depicted in Fig. 1(d). To this end, we also note that with our method it is also straightforward to simulate sequences of quantum measurements. Concretely, to measure an $N$ -th qubit system, we can first compute the probability that a qubit is in state $|0\rangle$ or $|1\rangle$ . Then, we use another copy of the wavefunction (which is stored as PEPS), project the measured qubit in the relevant state, measure another qubit and so forth. In between different measurements more gates can be applied too, all seamlessly because we can effectively and efficiently compute and store the wavefunction of the system.

Application to random quantum circuits and complexity analysis. In the following, we apply our PEPS simulator to study the two-dimensional RQCs of Git ; sup . The simulation of this circuit is divided into two parts: (i) circuit evolution and (ii) computing the overlap with randomly selected spin configurations, namely calculating the amplitudes. To quantify the size of the bond dimension required by the tensors, we realize that a single-qubit operation does not affect the size of the tensor it operates on, while a nearest-neighbour two-qubit controlled operation increases the sizes of the two tensors it operates on by a factor of $2$ as shown previously iSW . This results in

[TABLE]

where $\lceil\dots\rceil$ is the ceiling function. The equality in Eq. (8) is reached if the depth $d$ can be divided by $8$ (each nearest-neighbour pair of sites will be acted on by a CZ gate in every $8$ depths). As can be seen from Eqs. (3, 5, 6), the cost of each gate operation on PEPS scales as $O(\chi^{4})$ , which is relatively cheap. As a result, circuit evolution can be performed very efficiently. In fact, we can simulate the exact evolution of a $12\times 12$ lattice to a depth $(1+40+1)$ within minutes on a personal laptop.

In contrast, a well-known result about PEPS is that exactly computing the overlap as in Eq. (S1) is an exponentially hard problem Schuch et al. (2007). While there exist approximate algorithms to evaluate Eq. (S1) which scale polynomially with $\chi$ Verstraete et al. (2006); Jiang et al. (2008); Gu et al. (2008), they are inadequate for RQCs due to the large entanglement of the states produced. In the following we ignore both the space and time complexity of circuit evolution and only focus on calculating one amplitude, since the cost of the former stage is negligible compared to the latter.

We have developed different strategies to evaluate Eq. (S1) efficiently, depending on the shape and size of the lattice. A generic strategy which works for any rectangular lattice has space and time complexities (assuming $L_{v}\geq L_{h}$ ) given by

[TABLE]

For square lattices, specialized tensor contraction strategies can be used to further reduce the complexity or for better parallelization (see sup for details of these strategies). We highlight here that Eqs. (S7,S8) are more accurate estimates for space and time complexities compared to the results of Boixo et al. (2017), and the exact value will depend on the details of the particular implementation on the hardware. However, these numbers can work as a theoretical approximate benchmarking baseline for achieving quantum supremacy.

To give more precise numbers, using Eqs. (S7,S8) we can evaluate that simulating a $8\times 8$ lattice to a depth $(1+40+1)$ (same space complexity of a $10\times 10$ circuit to a depth $(1+32+1)$ ) would require $32$ TB of memory, while simulating a $8\times l$ (with $l>8$ ) lattice to a depth $(1+40+1)$ would require about $0.5$ PB memory. However, simulating a $9\times 9$ lattice with a depth $(1+40+1)$ would require $16$ PB (petabytes) memory and simulating a $12\times 12$ lattice to a depth $(1+32+1)$ would require $8$ PB memory, which are currently out of reach. Our circuit simulator can straightforwardly be extended to other types of two-dimensional lattices including Google Bristlecone QPU architecture. By applying a complexity analysis to this architecture, we find that it only requires less than a manageable $0.6$ PB of memory to simulate an RQC with $72$ qubits at depth $(1+32+1)$ (for details of this analysis see sup ).

To demonstrate the performance of our method, we have implemented small scale simulations on a personal computer, which takes less than $1$ hour to compute one amplitude of a $8\times 8$ circuit to a depth $(1+25+1)$ for a machine with $2$ cores of $2.8$ GHz frequency and $16$ GB memory. We computed $10000$ amplitudes then plotted the frequency with which each probability of configurations appear. This is represented in Fig. 2 by blue circles while the red continuous line shows the Porter-Thomas distribution, which is what is expected theoretically.

Our PEPS-based method can be readily scaled up onto a massive parallel computing platform. We implemented the large scale tensor contractions based on an open-source software package Cyclops Tensor Framework Solomonik et al. (2014). The massive parallel benchmarking was executed on the Tianhe-2 supercomputer Liao et al. (2014). We have simulated a $7\times 7$ circuit with depth $(1+40+1)$ and a $10\times 10$ circuit with depth $(1+26+1)$ . The simulation of the $7\times 7\times(1+40+1)$ circuit was done on 4096 nodes (22%) of Tianhe-2, taking 31 minutes and 92.51 TB memory in total sup . Our large-scale simulation results are listed in TABLE 1.

Conclusions. In this work we have adapted the Projected Entangled-Pair States representation of quantum states from many-body quantum physics to build a general-purpose quantum circuit simulator. This simulator can be used to store effectively highly entangled wavefunctions, and it is readily adaptable to compute expectation values or simulate sequential quantum measurements. With this circuit simulator, we have computed an accurate estimate for the space and time complexity analysis of a standard random quantum circuit Git . Based on this analysis, we point out that simulating an $8\times l$ circuit to a depth $(1+40+1)$ or a Bristlecone- $72$ circuit to a depth $(1+32+1)$ are within reach of current supercomputing platforms.

We have implemented numerical experiments on a personal computer with a $8\times 8$ circuit to a depth $(1+25+1)$ , and on Tianhe-2 supercomputer with a $10\times 10$ circuit to a depth $(1+26+1)$ , as well as a $7\times 7$ circuit to a depth $(1+40+1)$ . Currently we compute the amplitudes exactly, however we could also investigate the trade-off between fidelity and speed, so as to be able to sample many trajectories. For instance, we could reduce the memory requirement of our method by using the ‘cut’ technique in Villalonga et al. (2019), namely mapping a large tensor contraction into summations over many smaller tensor contractions by unraveling several for-loops. More importantly, PEPS-based techniques which are currently used in quantum many-body physics can be transferred to the study of quantum circuits, for example for contractions and the evaluation of expectation values Lubasch et al. (2014). These investigations, which could be particularly useful for circuits in which the wavefunction can be effectively compressed, are left for future works, together with the plan to include the effects of noise or errors in order to characterize more closely the actual behavior of a noisy intermediate-scale quantum computer.

Acknowledgements.

We gratefully acknowledge the help from China Greatwall Technology and National Supercomputing Center in Guangzhou. We thank Sergio Boixo and Giacomo Nannicini for helpful discussions. C. G. acknowledges support from National Natural Science Foundation of China under Grants No. 11504430 and No. 11805279. H.-L. H. acknowledges support from the Open Research Fund from State Key Laboratory of High Performance Computing of China (Grant No. 201901-01), National Natural Science Foundation of China under Grants No. 11905294, and China Postdoctoral Science Foundation. D.P. acknowledges support from the Singapore Ministry of Education, Singapore Academic Research Fund Tier-II (project MOE2016-T2-1-065). J.W. acknowledges support from National Natural Science Foundation of China under Grants No. 61632021.

I Introduction to Random Quantum Circuits

For a $L_{v}\times L_{h}$ qubit lattice, the Random Quantum Circuit (RQC) defined by Boixo et al. (2018) is described as follows:

Apply a Hadamard gate to each qubit to initialize the qubits to a symmetric superposition. 2. 2.

Apply controlled-phase (CZ) gates alternating between eight configurations similar to Fig. S1 to entangle neighbouring qubits. 3. 3.

Apply a randomly chosen gate (T, $\text{X}^{1/2}$ or $\text{Y}^{1/2}$ ) to each qubit on which the CZ gates has not just been applied, according to the rules in Boixo et al. (2018). 4. 4.

Repeat steps 2 and 3 to add layers of depth to the circuit. 5. 5.

Apply a final Hadamard gate to each qubit.

It has been proven that this random quantum circuit satisfies both average-case hardness and anti-concentration condition Bouland et al. (2018), and hence it cannot be efficiently simulated on a classical computer.

II Algorithm for Exact Computation of the Overlap

Depending on the shape of the lattice, we have developed three different strategies to evaluate the contraction of the tensor netwrok, which are shown in Fig. S2.

In the following we first show a generic way to evaluate the overlap in the main text exactly, namely the equation

[TABLE]

Assuming $L_{h}\leq L_{v}$ , we first contract all the tensors on the first row to get a rank- $L_{h}$ tensor

[TABLE]

where the bottom legs $d_{(1,n)}$ of ${\mathbf{E}}_{(1,n)}$ ( $1\leq i\leq L_{h}$ ) are written explicitly to indicate that they are not contracted in this step. Note also the notation for the position with two numbers instead of one, i.e. $(n,m)$ indicates the qubit on the $n-$ th row and $m-$ th column. Next we contract $F_{1}$ with the first tensor in the second row ${\mathbf{E}}_{(2,1)}$ and get

[TABLE]

where we have used the fact that for ${\mathbf{E}}_{21}$ one has the size $\dim(l_{(2,1)})=1$ and $u_{(2,1)}=d_{(1,1)}$ . The resulting tensor $G_{1}$ is a rank- $(L_{h}+1)$ tensor. Then we contract $G_{1}$ with the second tensor in the second row ${\mathbf{E}}_{(2,2)}$ and get

[TABLE]

where we have used the fact that for ${\mathbf{E}}_{(2,2)}$ one has $l_{(2,2)}=r_{(2,1)}$ and $u_{(2,2)}=d_{(1,2)}$ , and the resulting tensor $G_{2}$ is again a rank- $(L_{h}+1)$ tensor. We can repeat this procedure and move on to the right until we have contracted all the tensors on the second row and get

[TABLE]

where we have used the fact $\dim(r_{(2,L_{h})})=1$ and redefined $G_{L_{h}}$ and $F_{2}$ . Noticing that $F_{2}$ has the same structure as $F_{1}$ , therefore we repeat this procedure until we have reached the last row and get $F_{L}$ , which is a scalar since all the indexes $\dim(d_{(L_{v},n)})=1$ for $1\leq n\leq L_{h}$ . Thus we get

[TABLE]

From this analysis it appears that the largest tensor involved in this procedure is rank- $(L_{h}+1)$ . Moreover, for $L_{h}>L_{v}$ , instead of moving from top down, it is straightforward to slightly modify the algorithm to move from left to right, and the largest tensor involved would become rank- $(L_{v}+1)$ . Therefore the memory required scales exponentially with the exponent $\min(L_{h}+1,L_{v}+1)$ .

This generic strategy is shown in Fig. S2(a), where the tensor network is contracted row by row (ideal for a thin lattice where, for instance, $L_{v}>L_{h}$ ). Mathematically, this scheme corresponds to Eqs.(S2-S6). The largest tensor involved in this process is rank- $(L+1)$ where we have defined $L=\min(L_{h},L_{v})$ . Assuming a memory efficient implementation of tensor contraction, one would only require a single tensor of such size since the operand tensor could be overwritten. In the mean time, the most time-consuming step is Eq.(S4), in which one contracts two legs of a rank- $(L+1)$ tensor with two legs of another $4$ -dimensional tensor, a process which is repeated $(L_{h}-2)(L_{v}-2)$ times. Thus with the contraction scheme in Fig. S2(a), the space and time complexity are

[TABLE]

Note that these are very accurate evaluations with a clear prefactor and not just order of magnitude estimates, although the complexities can be reduced by using advanced matrix-matrix multiplications schemes and by parallelizing the operation.

For the special case of a square lattice with $L_{h}=L_{v}=\sqrt{N}$ , it is possible to improve the performance via a particular partitioning of the sum, as shown in Fig. S2. The partitioning strategies for network with even or odd side length are different, as shown in Fig. S2(b,c) respectively. For the network with even side lengths, tensors are divided into four parts first, as Fig. S2(b) illustrates. We start the contraction of the tensors from the upper-left partition, obtaining a rank- $\sqrt{N}$ tensor which we refer to as $F_{ul}$ . Similarly, the other three partitions produce another three rank- $\sqrt{N}$ tensors, denoted as $F_{ur}$ (upper-right), $F_{bl}$ (bottom-left) and $F_{br}$ (bottom-right). Then, we contract $F_{ul}$ with $F_{ur}$ , and $F_{bl}$ with $F_{br}$ . Consequently, by contacting the remaining two tensors together, we get the amplitude value. As a result, the complexity of this strategy is

[TABLE]

The algorithm for the network with odd side lengths ( $L_{h}=L_{v}=2m+1$ ) is relatively more complicated. The tensors are partitioned into 4 groups, as shown in Fig. S2(c). The contraction starts from the up-left $(m+1)\times m$ partition, producing a rank- $\sqrt{N}$ tensor denoted as $F_{ul}$ . Then we move to the other three parts and contract them into $F_{ur},F_{bl},F_{br}$ same as $F_{ul}$ . The contraction of $F_{ur}$ can again be divided into 4 sub-procedures, which are indicated in Fig. S2(c) by the gray dashed lines that break the lattices into 3 small groups. The sub-procedures are: (1) Contracting the right $(m+1)\times m$ tensors into a rank- $\sqrt{N}$ tensor; (2) Contracting the first $m$ tensors at the $m+1$ -th column into a rank- $\sqrt{N}$ tensor; (3) Contracting the two rank- $\sqrt{N}$ tensors from procedure (1) and (2) into a rank- $(\sqrt{N}+1)$ tensor; (4) Contracting the obtained rank- $(\sqrt{N}+1)$ tensor with the rank-4 tensor located in the center of the lattice (which is also the left-bottom corner of $F_{ur}$ ), and resulting in a rank- $(\sqrt{N}+1)$ tensor. Then, by contracting the four parts together, we get the probability amplitude. As a result, the complexity of this strategy is

[TABLE]

In Fig. S3(a) we show the space and time complexities for $8\times l$ circuits for $d=1+40+1$ (or $10\times l$ circuits with depth of $d=1+32+1$ ), showing that they are within reach for state-of-the-art supercomputers. This shows clearly where the frontier for quantum supremacy stands for this random quantum circuit and for our method. In Fig. S3(b) we show the space and time complexities computed from Eqs.(S9-S12). To this end, we note that our algorithm can be straightforwardly combined with the fast sampling method in Villalonga et al. (2019); Liu et al. (2019) to measure a large number of amplitudes. Following the partitioning strategy, one can sample in one partition with negligible additional cost since the results of the other regions can be reused.

III Complexity Analysis of Google Bristlecone QPU

To simulate the Google Bristlecone QPU with PEPS, both the representation of the quantum state as well as the gate operations are implemeted exactly in the same way as for the rectangular lattice case. The only difference is that during the measurement stage, the tensor network that needs to be contracted are rotated by $45$ degree compared to a rectangular lattice. In Fig. S4 we show a contraction strategy for the simulation of a Google Bristlecone QPU. From Fig. S4 we can see that the number of legs of a tensor is at most $11$ , and hence the space cost for simulating this circuit to a depth $(1+32+1)$ with our circuit simulator scales as $2^{32/8\times 11+1}=2^{45}$ , which corresponds to less than $0.6$ PB memory.

IV Massive Parallel Benchmarking on Supercomputer

We have implemented our large scale tensor contraction algorithms based on an open-source software package Cyclops Tensor Framework Solomonik et al. (2014), with MPI and OpenMP as the parallel interfaces. The massive parallel benchmarking was then executed on Tianhe-2 supercomputer. According to the features of the supercomputer platform and the results of the scaling test, we chose to use one MPI process with 24 OpenMP threads on each node. Each normal node contains two 12-core CPUs, and is equipped with 64GB (128 GB on each fat node) memory. The maximum number of nodes used reaches 4,096 (98,304 compute cores in total), which is less than 1/4 of the whole system, and since we only use CPUs, the peak performance we use is $\sim 1.73$ PFlops. All our calculations are done with double-precision numbers. Our results are listed in Table.1 of the main manuscript.

The numerical simulation with the largest number of qubits is a $10\times 10$ circuit with $d=(1+26+1)$ , which is done on 1,024 normal nodes and takes 6 minutes to measure one amplitude, using the partitioning strategy as in Fig. S2(c). The numerical simulation with the largest depth is a $7\times 7$ circuit with $d=(1+40+1)$ , which is done on $4,096$ fat nodes and takes 31 minutes. On each fat node 23.13 GB memory is used, and thus this simulation takes 92.51 TB memory in total (detailed data can be found in supplementary information). To pursue efficiency, parts of the data is duplicated on several computing nodes to reduce the cost of data communication, leading to a larger memory usage than theoretical prediction $16$ TB. Here we note that recently in Villalonga et al. (2019) the authors compute, with $0.5\%$ fidelity, $10^{6}$ amplitudes for a $7\times 7\times(1+40+1)$ random quantum circuit with single-precision numbers on Summit in 2.4 hours, using 2.67 PB memory and ${\text{R}}_{\text{Node-peak}}=200.8$ PFlops. Their optimized implementation, when mapped to unit fidelity, is currently faster than our proof-of-principle calculation.

References

Feynman (1982)

R. P. Feynman, International journal of theoretical physics 21, 467 (1982).

Shor (1994)

P. W. Shor, in Proceedings 35th annual symposium on foundations of computer science (Ieee, 1994), pp. 124–134.

Boixo et al. (2018)

S. Boixo, S. V. Isakov, V. N. Smelyanskiy, R. Babbush, N. Ding, Z. Jiang, M. J. Bremner, J. M. Martinis, and H. Neven, Nature Physics 14, 595 (2018).

Lund et al. (2017)

A. Lund, M. J. Bremner, and T. Ralph, npj Quantum Information 3, 15 (2017).

Huang et al. (2017)

H.-L. Huang, Q. Zhao, X. Ma, C. Liu, Z.-E. Su, X.-L. Wang, L. Li, N.-L. Liu, B. C. Sanders, C.-Y. Lu, et al., Phys. Rev. Lett. 119, 050503 (2017).

Zhang et al. (2017)

J. Zhang, G. Pagano, P. W. Hess, A. Kyprianidis, P. Becker, H. Kaplan, A. V. Gorshkov, Z.-X. Gong, and C. Monroe, Nature 551, 601 (2017).

Huang et al. (2018)

H.-L. Huang, X.-L. Wang, P. P. Rohde, Y.-H. Luo, Y.-W. Zhao, C. Liu, L. Li, N.-L. Liu, C.-Y. Lu, and J.-W. Pan, Optica 5, 193 (2018).

Wright et al. (2019)

K. Wright, K. Beck, S. Debnath, J. Amini, Y. Nam, N. Grzesiak, J.-S. Chen, N. Pisenti, M. Chmielewski, C. Collins, et al., arXiv:1903.08181 (2019).

Kelly et al. (2019)

J. Kelly, Z. Chen, B. Chiaro, B. Foxen, J. Martinis, and Q. H. T. Team, in APS Meeting Abstracts (2019).

Gong et al. (2019)

M. Gong, M.-C. Chen, Y. Zheng, S. Wang, C. Zha, H. Deng, Z. Yan, H. Rong, Y. Wu, S. Li, et al., Phys. Rev. Lett. 122, 110501 (2019).

Wang et al. (2018)

X.-L. Wang, Y.-H. Luo, H.-L. Huang, M.-C. Chen, Z.-E. Su, C. Liu, C. Chen, W. Li, Y.-Q. Fang, X. Jiang, et al., Phys. Rev. Lett. 120, 260502 (2018).

Preskill (2012)

J. Preskill, arXiv:1203.5813 (2012).

Aaronson and Arkhipov (2011)

S. Aaronson and A. Arkhipov, in Proceedings of the forty-third annual ACM symposium on Theory of computing (ACM, 2011), pp. 333–342.

Wu et al. (2018)

J. Wu, Y. Liu, B. Zhang, X. Jin, Y. Wang, H. Wang, and X. Yang, National Science Review 5, 715 (2018).

Shepherd and Bremner (2009)

D. Shepherd and M. J. Bremner, Proceedings of the Royal Society A: Mathematical, Physical and Engineering Sciences 465, 1413 (2009).

Bremner et al. (2010)

M. J. Bremner, R. Jozsa, and D. J. Shepherd, Proceedings of the Royal Society A: Mathematical, Physical and Engineering Sciences 467, 459 (2010).

Bouland et al. (2018)

A. Bouland, B. Fefferman, C. Nirkhe, and U. Vazirani, arXiv:1803.04402 (2018).

Eisert (2013)

J. Eisert, Modeling and Simulation 3, 520 (2013).

Orús (2014)

R. Orús, Annals of Physics 349, 117 (2014).

McCaskey et al. (2018)

A. McCaskey, E. Dumitrescu, M. Chen, D. Lyakh, and T. Humble, PLoS ONE 13, 12 (2018).

Verstraete and Cirac (2004)

F. Verstraete and J. I. Cirac, arXiv:cond-mat/0407066 (2004).

Verstraete et al. (2006)

F. Verstraete, M. M. Wolf, D. Perez-Garcia, and J. I. Cirac, Phys. Rev. Lett. 96, 220601 (2006).

Murg et al. (2007)

V. Murg, F. Verstraete, and J. I. Cirac, Phys, Rev. A 75, 033605 (2007).

Jordan et al. (2008)

J. Jordan, R. Orús, G. Vidal, F. Verstraete, and J. I. Cirac, Phys. Rev. Lett. 101, 250602 (2008).

Gu et al. (2008)

Z.-C. Gu, M. Levin, and X.-G. Wen, Phys. Rev. B 78, 205116 (2008).

Jiang et al. (2008)

H.-C. Jiang, Z.-Y. Weng, and T. Xiang, Phys. Rev. Lett. 101, 090603 (2008).

Xie et al. (2009)

Z.-Y. Xie, H.-C. Jiang, Q. N. Chen, Z.-Y. Weng, and T. Xiang, Phys. Rev. Lett. 103, 160601 (2009).

Murg et al. (2009)

V. Murg, F. Verstraete, and J. I. Cirac, Phys. Rev. B 79, 195119 (2009).

Aaronson and Chen (2016)

S. Aaronson and L. Chen, arXiv:1612.05903 (2016).

Boixo et al. (2017)

S. Boixo, S. V. Isakov, V. N. Smelyanskiy, and H. Neven, arXiv:1712.05384 (2017).

Häner and Steiger (2017)

T. Häner and D. S. Steiger, in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (ACM, 2017), p. 33.

Chen et al. (2018)

Z.-Y. Chen, Q. Zhou, C. Xue, X. Yang, G.-C. Guo, and G.-P. Guo, Science Bulletin 63, 964 (2018).

Li et al. (2018)

R. Li, B. Wu, M. Ying, X. Sun, and G. Yang, arXiv:1804.04797 (2018).

Pednault et al. (2017)

E. Pednault, J. A. Gunnels, G. Nannicini, L. Horesh, T. Magerlein, E. Solomonik, and R. Wisnieff, arXiv:1710.05867 (2017).

Chen et al. (2019)

M.-C. Chen, R. Li, L. Gan, X. Zhu, G. Yang, C.-Y. Lu, and J.-W. Pan, arXiv:1901.05003 (2019).

Markov et al. (2018)

I. L. Markov, A. Fatima, S. V. Isakov, and S. Boixo, arXiv:1807.10749 (2018).

Villalonga et al. (2018)

B. Villalonga, S. Boixo, B. Nelson, C. Henze, E. Rieffel, R. Biswas, and S. Mandrà, arXiv:1811.09599 (2018).

Villalonga et al. (2019)

B. Villalonga, D. Lyakh, S. Boixo, H. Neven, T. S. Humble, R. Biswas, E. G. Rieffel, A. Ho, and S. Mandrà, arXiv:1905.00444 (2019).

(39)

Available on GitHub at https://github.com/sboixo/GRCS.

(40)

See supplementary material for the details on the description of RQCs, the algorithms for computing an amplitude, and the massive parallel benchmarking on Tianhe-2 supercomputer.

(41)

Schuch et al. (2007)

N. Schuch, M. M. Wolf, F. Verstraete, and J. I. Cirac, Phys. Rev. Lett. 98, 140506 (2007).

Solomonik et al. (2014)

E. Solomonik, D. Matthews, J. R. Hammond, J. F. Stanton, and J. Demmel, Journal of Parallel and Distributed Computing 74, 3176 (2014).

Liao et al. (2014)

X. Liao, L. Xiao, C. Yang, and Y. Lu, Frontiers of Computer Science 8, 345 (2014).

Lubasch et al. (2014)

M. Lubasch, J. I. Cirac, and M.-C. Bañuls, New Journal of Physics 16, 033014 (2014).

Liu et al. (2019)

Y. Liu, M. Xiong, C. Wu, D. Wang, Y. Liu, J. Ding, A. Huang, X. Fu, X. Qiang, P. Xu, et al., arXiv:1907.08077 (2019).

Bibliography91

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Feynman (1982) R. P. Feynman, International journal of theoretical physics 21 , 467 (1982).
2Shor (1994) P. W. Shor, in Proceedings 35th annual symposium on foundations of computer science (Ieee, 1994), pp. 124–134.
3Boixo et al. (2018) S. Boixo, S. V. Isakov, V. N. Smelyanskiy, R. Babbush, N. Ding, Z. Jiang, M. J. Bremner, J. M. Martinis, and H. Neven, Nature Physics 14 , 595 (2018).
4Lund et al. (2017) A. Lund, M. J. Bremner, and T. Ralph, npj Quantum Information 3 , 15 (2017).
5Huang et al. (2017) H.-L. Huang, Q. Zhao, X. Ma, C. Liu, Z.-E. Su, X.-L. Wang, L. Li, N.-L. Liu, B. C. Sanders, C.-Y. Lu, et al., Phys. Rev. Lett. 119 , 050503 (2017).
6Zhang et al. (2017) J. Zhang, G. Pagano, P. W. Hess, A. Kyprianidis, P. Becker, H. Kaplan, A. V. Gorshkov, Z.-X. Gong, and C. Monroe, Nature 551 , 601 (2017).
7Huang et al. (2018) H.-L. Huang, X.-L. Wang, P. P. Rohde, Y.-H. Luo, Y.-W. Zhao, C. Liu, L. Li, N.-L. Liu, C.-Y. Lu, and J.-W. Pan, Optica 5 , 193 (2018).
8Wright et al. (2019) K. Wright, K. Beck, S. Debnath, J. Amini, Y. Nam, N. Grzesiak, J.-S. Chen, N. Pisenti, M. Chmielewski, C. Collins, et al., ar Xiv:1903.08181 (2019).