Numerical Assessment for Accuracy and GPU Acceleration of TD-DMRG Time Evolution Schemes
Weitang Li, Jiajun Ren, Zhigang Shuai

TL;DR
This paper evaluates the accuracy of three TD-DMRG time evolution schemes and demonstrates significant GPU acceleration, enhancing the efficiency of quantum dynamics simulations for complex systems.
Contribution
It provides a comparative analysis of three TD-DMRG schemes and introduces GPU acceleration to significantly speed up tensor contractions in these methods.
Findings
TDVP-PS allows larger time steps than TDVP-MU.
TDVP-MU and TDVP-PS are more accurate than P&C-RK4.
GPU acceleration speeds up TDVP schemes by up to 73 times.
Abstract
Time dependent density matrix renormalization group (TD-DMRG) has become one of the cutting edge methods of quantum dynamics for complex systems. In this paper, we comparatively study the accuracy of three time evolution schemes in TD-DMRG, the global propagation and compression method with Runge-Kutta algorithm (P&C-RK), the time dependent variational principle based methods with matrix unfolding algorithm (TDVPMU) and with projector-splitting algorithm (TDVP-PS), by performing benchmarks on the exciton dynamics of Fenna-Matthews-Olson (FMO) complex. We show that TDVP-MU and TDVP-PS yield the same result when the time step size is converged and they are more accurate than P&C-RK4, while TDVP-PS tolerates a larger time step size than TDVP-MU. We further adopt the graphical processing units (GPU) to accelerate the heavy tensor contractions in TD-DMRG and it is able to speed up the…
| Sub-step 111See the text in Section II for the definitions and labels of the sub-steps. | P&C-RK4 | TDVP-MU | TDVP-PS |
| — | — | ||
| QR | — | ||
| MatMul-QR | — | ||
| SVD | — | ||
| MatMul-SVD | — | ||
| Get Env | — | ||
| Deriv | — | (forward) (backward) | |
| Overall | (VMF) (CMF) |
| error () | ||
| 32 | ||
| 64 | ||
| 128 | ||
| 192 | ||
| error () | ||
| 320 | ||
| 160 | ||
| 80 | ||
| 40 | ||
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Numerical Assessment for Accuracy and GPU Acceleration of TD-DMRG Time Evolution Schemes
Weitang Li
MOE Key Laboratory of Organic OptoElectronics and Molecular Engineering, Department of Chemistry, Tsinghua University, Beijing 100084, People’s Republic of China
Jiajun Ren
MOE Key Laboratory of Organic OptoElectronics and Molecular Engineering, Department of Chemistry, Tsinghua University, Beijing 100084, People’s Republic of China
Zhigang Shuai
MOE Key Laboratory of Organic OptoElectronics and Molecular Engineering, Department of Chemistry, Tsinghua University, Beijing 100084, People’s Republic of China
Abstract
Time dependent density matrix renormalization group (TD-DMRG) has become one of the cutting edge methods of quantum dynamics for complex systems. In this paper, we comparatively study the accuracy of three time evolution schemes in TD-DMRG, the global propagation and compression method with Runge-Kutta algorithm (P&C-RK), the time dependent variational principle based methods with matrix unfolding algorithm (TDVP-MU) and with projector-splitting algorithm (TDVP-PS), by performing benchmarks on the exciton dynamics of Fenna-Matthews-Olson (FMO) complex. We show that TDVP-MU and TDVP-PS yield the same result when the time step size is converged and they are more accurate than P&C-RK4, while TDVP-PS tolerates a larger time step size than TDVP-MU. We further adopt the graphical processing units (GPU) to accelerate the heavy tensor contractions in TD-DMRG and it is able to speed up the TDVP-MU and TDVP-PS schemes by up to 73 times.
††preprint: AIP/123-QED
I Introduction
Time dependent density matrix renormalization group (TD-DMRG) has emerged as a powerful tool to deal with many-body chemical and physical problems Ma, Luo, and Yao (2018); Paeckel et al. (2019). Although DMRG is initially designed to solve the ground state of one dimensional strongly correlated systems White (1992, 1993), its applications are successfully extended to dynamical properties both in the time and frequency domains, such as linear and nonlinear optical response of polyenes Shuai et al. (1998), polaron formation and diffusion Zhang, Jeckelmann, and White (1999); Yao (2017); Barford and Mannouch (2018), interconversion dynamics of pyrazine Greene and Batista (2017); Baiardi and Reiher (2019); Xie et al. (2019), ab initio electron dynamics in hydrogen chain Ronca et al. (2017), exciton dissociationYao et al. (2018), spectra of molecular aggregates Ren, Shuai, and Kin-Lic Chan (2018) and many other topics Chin et al. (2013); Schröder et al. (2019); Borrelli and Gelin (2017); Kloss and Lev (2019); Kloss, Reichman, and Tempelaar (2019); Frahm and Pfannkuche (2019). One of the key components in TD-DMRG is the time evolution scheme, which is essential to the numerical accuracy and efficiency. The available schemes could be roughly classified into three groups. The first group is based on globally approximating the formal propagator or , including time-evolving block decimation (TEBD) Vidal (2004); White and Feiguin (2004); Daley et al. (2004), WI,II method Zaletel et al. (2015), Runge-Kutta García-Ripoll (2006); Ren, Shuai, and Kin-Lic Chan (2018), Chebyshev expansion Halimeh, Kolley, and McCulloch (2015), Krylov subspace García-Ripoll (2006); Wall and Carr (2012) methods and split operator method on the grid basis Greene and Batista (2017). The same feature shared in these schemes is that in each time step the wavefunction is firstly propagated as a whole globally and then compressed. The second group is more inspired by the original DMRG, which is formulated in the local renormalized space and the basis is adapted by the averaged reduced density matrix. The representatives are time step targeting method (TST) Feiguin and White (2005) and some related variants Dutta and Ramasesha (2010); Ronca et al. (2017). The third group is based on the time dependent variational principle (TDVP) Dirac (1930). Depending on the different ways to derive the equations of motion (EOMs), this group includes original method with fixed gauge freedom Haegeman et al. (2011) and the more recent projector-splitting method (PS) from a tangent space view Haegeman et al. (2016). Among the above evolution schemes, all schemes are suited to long-range interactions except TEBD. In addition, the global evolution scheme is the most straightforward one when the modern framework of matrix product state / matrix product operator (MPS/MPO) is investigated, while the PS scheme seems to have become the most popular choice as it has been widely employed in the recent articles Borrelli and Gelin (2017); Schröder et al. (2019); Baiardi and Reiher (2019); Kloss and Lev (2019); Kloss, Reichman, and Tempelaar (2019); Kurashige (2018); Xie et al. (2019) and implemented in a number of TD-DMRG packages Hauschild and Pollmann (2018); Mendl (2018). Although many evolution schemes have been applied extensively to the simulation of quantum dynamics in both chemistry and physics, a pragmatic analysis of their relative accuracy and efficiency has not been reported yet.
High performance computing for DMRG algorithms has attracted much interest in recent years, including parallelization strategies with message passing interface (MPI) Chan (2004), open multi-processing (OpenMP) and hybrid MPI/OpenMPKurashige and Yanai (2009) on multiple central processing units (CPUs). Meanwhile, the application of graphical processing units (GPUs) in computational chemistry has also drawn much attention in the past decade, due to the tardy improvement of CPUs, and the more and more vibrant GPU software ecosystem, including electronic structure calculation Song and Martínez (2017); Kaliman and Krylov (2017); Shee et al. (2018); Liu et al. (2019), classical/ab initio molecular dynamics Penfold (2017); Lee et al. (2018), and open system quantum dynamics Kreisbeck et al. (2011); Tsuchimoto and Tanimura (2015). To the best of our knowledge, the only attempt to employ GPUs in DMRG was made by Nemes et al. in 2014 Nemes et al. (2014). They came up with a smart implementation exploiting both CPU and GPU which speeds up the Davidson diagonalization part in DMRG algorithm by 2 to 5 times. Despite their efforts, how GPUs can accelerate TD-DMRG algorithms and which evolution scheme is more suitable remain unclear.
To benchmark the performance of different TD-DMRG time evolution schemes, a proper model should be chosen. In this work, we focus on the vibronic coupling system represented by the Fenna-Matthews-Olson (FMO) complex from green sulfur bacteria. It is an extremely popular system for both experimental and theoretical studies of energy transfer in photosynthesis Engel et al. (2007); Cheng and Fleming (2009). The FMO model has become a “guinea pig” for comparing different computational methods Borrelli and Gelin (2017) and its exciton dynamics has been studied by many numerically exact methods, including quasi-adiabatic propagator path integral (QUAPI) Nalbach, Braun, and Thorwart (2011), hierarchical equations of motion (HEOM) Ishizaki and Fleming (2009); Shi et al. (2018), multi-layer multi-configuration time-dependent Hartree (ML-MCTDH) Schulze and Kuhn (2015); Schulze et al. (2016), TD-DMRG Chin et al. (2013); Borrelli and Gelin (2017), etc.
In this paper, we select three time evolution schemes in TD-DMRG to simulate the exciton dynamics of 7-site FMO model, including global propagation and compression scheme with classical 4th order Runge-Kutta algorithm (P&C-RK4), TDVP with advanced matrix unfolding regularization scheme (TDVP-MU) and TDVP with projector-splitting algorithm (TDVP-PS). We firstly study the relative accuracy of the three schemes on the FMO model and then set out to explore the CPU-GPU heterogeneous computing to accelerate the TD-DMRG algorithms.
II Methodological Approaches
II.1 MPS Representation and TD-DMRG Algorithms
In the language of matrix product states (MPS) Schollwöck (2011), a quantum state under certain basis , where is the basis for each degree of freedom (DOF) and is assumed to be orthonormal, can be represented as the product of a matrix chain, known as an MPS:
[TABLE]
are matrices in the chain connected by indices . in the summation represents the contraction of the respective connected indices, and is the total number of DOFs in the system. The dimension of is called (virtual) bond dimension denoted as or , while the dimension of is called physical bond dimension denoted as . The graphical representation of an MPS is shown in Fig. 1(a). The many-body renormalized basis is defined as:
[TABLE]
where and represent that the renormalized bases are defined from the left and right side respectively. Hence, could be regarded as the coefficient matrix on the space spanned by . However, in general, the renormalized basis is not necessary to be orthonormal, and the overlap matrix between them is defined as:
[TABLE]
The matrix product representation for a wavefunction is not unique in that inserting an identity matrix into the neighboring matrices will obtain the same wavefunction but with different local matrices . Therefore, gauge conditions could be applied to eliminate the parameterization redundancy of an MPS. Among them, “mixed/left/right-canonical” gauge condition is usually adopted for convenience. A “mixed-canonical” MPS with gauge center at site is written as:
[TABLE]
where and satisfy:
[TABLE]
here represents the conjugate of .
With Eq. 7 and 8, the renormalized basis fulfill the orthonormal relation () and (). When the gauge center is at the right(left)-most site, the MPS is called left(right)-canonical state. With the ‘mixed/left/right-canonical” condition, the only redundancy left is that G is unitary. If we further decompose by QR decomposition
[TABLE]
fulfills the relation in Eq. 7 and is also a coefficient matrix defined between site and on space . Afterwards, is combined with to obtain and apparently the gauge center is moved one site to the right. The reverse process to move the gauge center to the left could be carried out in a similar way as Eq. 9 while QR is replaced with RQ decomposition. More generally, a canonical MPS with gauge center at as Eq. 6 could be prepared by preforming QR decomposition from site 1 to sequentially and RQ decomposition from site to sequentially on any MPS, which is called canonicalisation. In the following paper, QR and RQ are not distinguished for simplicity.
If the QR decomposition in Eq. 9 is replaced with singular value decomposition (SVD),
[TABLE]
is a real diagonal matrix with elements ordered from large to small ( ) and if is normalized. fulfills the relation in Eq. 7. If we retain all () and move the gauge center to the right with , the wavefunction is not changed and SVD plays the same role as QR in canonicalisation but with a little bit higher cost. However, if we only retain the largest terms () to obtain , could be a good approximation to with a smaller bond dimension. With this algorithm, a left(right)-canonical MPS could be “compressed” to by successive approximate SVD decomposition from site to 1 (site 1 to ). At each local step of the compression, two truncation criteria are commonly used
(i) a fixed pre-defined ;
(ii) adaptive with all larger than pre-defined retained.
Apart from canonicalisation and compression, several other operations are usually met within the MPS algorithms, including and . Similar to MPS, a common quantum operator could be expressed as a matrix product operator (MPO) Schollwöck (2011); Chan et al. (2016):
[TABLE]
The graphical representation of an MPO is shown in Fig. 1(b). With the local matrix representation of wavefunction and operator in Eq. 1 and Eq. 11, (MPOMPS) could be calculated by contracting locally:
[TABLE]
Suppose the bond dimensions of the original MPS and the MPO are and respectively, then the new state has bond dimension . (MPS + MPS) is constructed by merging the local matrices block-diagonally,
[TABLE]
Suppose the bond dimensions of the original two MPSs are and respectively, then the new state has bond dimension .
In the next subsections, we present the ideas and the algorithms of three different TD-DMRG time evolution schemes adopted in our benchmark calculations afterwards: P&C-RK4, TDVP-MU and TDVP-PS.
II.1.1 Schemes based on Global Propagation and Compression
The propagation and compression (P&C) scheme is a global time evolution scheme benefiting from the modern MPS/MPO representation of DMRG wavefunctions and operators. If and the time derivative are explicitly known (according to Schrödinger equation, the time derivative is in atomic units), any ordinary differential equation (ODE) integrator for the initial value problem (IVP) such as the classical 4th order Runge-Kutta algorithm (RK4) we used in our former work Ren, Shuai, and Kin-Lic Chan (2018), could be applied to obtain the MPS of the next time step . The equations of RK4 are:
[TABLE]
Each is represented by an MPS and is represented by an MPO. Therefore, only two types of operations including MPOMPS (Eq. 13) and MPSMPS (Eq. 14) exist in Eq. 16. After each operation, the new MPS such as will have a larger virtual bond dimension, which should be compressed using algorithm based on Eq. 10 for the further operations. It is worth mentioning that as the new MPS is not canonical, canonicalisation should be carried out before the actual compression.
The procedure of P&C-RK4 described above involves (“MPO MPS”) scaling at for each site (labeled as “”), “MPS + MPS” which does not require intensive computation and MPS compression. The compression consists of four kinds of operations (two for canonicalisation in Eq. 9 and two for compression in Eq. 10), namely QR decomposition of tensors on each site (labeled as “QR”), matrix multiplication to absorb the decomposed coefficients (labeled as “MatMul-QR”) and the SVD counterparts (labeled as “SVD” and “MatMul-SVD”). “QR” and “MatMul-QR” both scale at while their SVD counterparts scale at for “SVD” and for “MatMul-SVD”, because a compression reduces bond dimension from to during the sweep. Thus, the bottleneck of P&C-RK4 is the canonicalisation and the scaling of the method in a single time step is .
Compared to the local evolution scheme such as TST Feiguin and White (2005), P&C could exactly calculate the operation of high order Hamiltonian such as and it is natural to increase the bond dimension with entanglement during the time evolution. However, though P&C is in principle a general evolution scheme for TD-DMRG, it prefers Hamiltonian whose MPO is straightforward to construct and has small bond dimension , such as Frenkel-Holstein type model whose is in a linear relationship to the number of electronic states Ren, Shuai, and Kin-Lic Chan (2018). With regard to the time step size, the error of P&C does not have a monotonic relationship with time step and there is an optimal time step depending on the system as discussed in the previous studies Ronca et al. (2017); Ren, Shuai, and Kin-Lic Chan (2018). A smaller time step will improve the accuracy of RK4 integrator but lead to more MPS compressions and then deteriorate the whole accuracy.
II.1.2 Schemes based on Time Dependent Variational Principle
The Rayleigh-Ritz variational principle is widely used in finding an approximate ground state in time independent Schrödinger equation. Similarly, time dependent variational principle (TDVP) also provides a strong tool to find an optimal time dependent wavefunction if the wavefunction ansatz and the initial state are known. The Dirac-Frenkel TDVP is Dirac (1930); Gatti et al. (2017)
[TABLE]
In a geometric fashion, TDVP could be understood as an orthogonal projection of onto the tangent space of at the current time:
[TABLE]
where is the projector constructed by the orthonormal vectors in the tangent space. For a general MPS in Eq.1, the tangent space projector could be defined as:
[TABLE]
Where
[TABLE]
The inversion of the overlap matrix accounts for the non-orthogonality of the renormalized basis and the “-” terms are to eliminate the parameterization redundancy Haegeman et al. (2011); Wouters et al. (2013). The graphical representation of the projector is shown in Fig. 2. In the literature, there are two different time evolution schemes based on TDVP. They differ in choosing the specific gauge condition of the MPS and in solving Eq. 18, which will be discussed in detail in the following.
TDVP-MU
In the first TDVP evolution scheme, the gauge freedom of MPS is fixed. For convenience, the projector in Eq. 19 could be transformed to Eq. 24 by combining the neighbouring “+” term and “-” term together except one “+” term with :
[TABLE]
where:
[TABLE]
This type of projector or the corresponding tangent space vectors were firstly proposed by Haegeman et al. Haegeman et al. (2011) in TD-DMRG and restated by Wouters et al. Wouters et al. (2013) to derive several post-DMRG methods for ground state and excited states.
Eq. 25 and 26 could be further simplified by adopting a specific gauge condition, and then some overlap matrices turn to identity. Assuming that the MPS is left-canonical with gauge center at site , is reduced to and it is most convenient to set in Eq. 24. Inserting the simplified projector into Eq. 18 yields:
[TABLE]
[TABLE]
where:
[TABLE]
Similar equations can be derived for right/mixed-canonical MPS. With Eq. 28 and Eq. 33, it is straightforward to prove and then:
[TABLE]
which ensures that the left-canonical condition preserves during the time evolution formally. Though in practical numerical calculation with a finite time step, the relation in Eq. 34 would not be rigorously fulfilled, this problem is not severe with a proper time step in our experience. Otherwise, the more general equations in Eq. 24 which have already considered the non-orthogonality of the left and right renormalized basis should be used. And it could be proved that and .
Eq. 27 and Eq. 28 together form a set of coupled nonlinear equations that are very similar to the standard EOMs of (ML-)MCTDH Meyer, Manthe, and Cederbaum (1990); Beck et al. (2000); Wang and Thoss (2003). To integrate these equations, we borrow ideas called variable mean field (VMF) and constant mean field (CMF) from the MCTDH community Beck and Meyer (1997); Beck et al. (2000). VMF employs an all-purpose integrator to directly solve the coupled equations. While in CMF, it is assumed that and generally change much slower in time than the local matrices and . As a result, during the integration of Eq. 27 and Eq. 28 one may hold the “mean field” and constant for and evolve only the local matrix with time step smaller than . Hence, CMF is more efficient yet less accurate than VMF. In this work, we use a second order CMF with a midpoint scheme in which and are integrated to and with “mean field” constructed by and . In VMF, adaptive Dormand-Prince’s 5/4 Runge-Kutta method (RK45) is adopted to integrate Eq. 27 and Eq. 28, while in CMF, since Eq. 27 is linear, it is integrated by Krylov subspace method instead.
Another aspect should be concerned is that the inversion of would be unstable numerically if some eigenvalues of are very small. This problem will be severe when the state is weakly correlated (such as a Hartree product state which is usually an initial state) and is much larger than what is required. To some extent, this instability problem makes this evolution scheme paradoxical in that large should in principle push the result to a numerically exact limit but in fact deteriorates it. The same problem also arises in (ML)-MCTDH, where in order to make the EOMs more “well-behaved”, is usually replaced with a regularized overlap matrix Beck et al. (2000):
[TABLE]
Here is a small scalar commonly from to . More recently, an improved regularization scheme based on the matrix unfolding (MU) of the coefficient matrix by SVD in (ML-)MCTDH is proposed by Meyer and Wang, which has been proved to make the time integration more accurate and robust Meyer and Wang (2018); Wang and Meyer (2018). The same idea is adopted here to integrate Eq. 28, giving the name of the scheme “TDVP-MU”. When calculating the overlap matrix , the gauge center is moved to the th site and the matrix at this site is further decomposed by SVD:
[TABLE]
where equals . Thus, the overlap matrix and its inversion could be expressed as and respectively. The Hamiltonian matrix in Eq. 29 is also reconstructed. For site from to the matrix in Eq. 32 is replaced with matrix in Eq. 36 and then Eq. 28 with matrix unfolding algorithm becomes:
[TABLE]
The expression inside “” is . The key point of this new regularization scheme is that the underlined part could be contracted first, which is . Thus, only the singular matrix instead of should be regularized:
[TABLE]
The power here is for consistency with the original regularization scheme (Eq. 35) and in Eq. 37 is untouched in order to be minimally invasive as stated in Ref. Meyer and Wang, 2018; Wang and Meyer, 2018. In this work we choose unless otherwise stated.
The complexity analysis of the TDVP-MU scheme with VMF and CMF is as follows. Note that in TD-DMRG it is necessary to contract small matrices one by one instead of explicitly constructing the large tensors Chan (2004); Schollwöck (2011). VMF and CMF both require the calculation of environment matrices and scaling at (labeled as “Get Env”), SVD decomposition of the gauge center with the subsequent matrix multiplication scaling at in Eq. 36 (labeled as “SVD” and “MatMul-SVD”), and the calculation of time derivative in Eq. 27 and Eq. 37 (labeled as “Deriv”). Evaluation of the time derivatives contains a number of matrix multiplications and the overall scaling is , where the first two terms and the last term correspond to the contraction of and respectively. Thus the total scaling in VMF is , while in CMF the total scaling for a single step is where is the number of steps required for the propagation of each local site within . Although at first glance CMF is more time-consuming than VMF, much larger step size is possible for CMF which reduces the time cost spent on decomposing the matrix and constructing the environment, compensating the effect of the factor . This time saving in CMF is important in MCTDH in that constructing the environment involves a contraction of the high order coefficient matrix Beck and Meyer (1997); Beck et al. (2000); Gatti et al. (2017). However, in MPS context such advantage is not prominent because all matrices including the coefficient matrix are of the same size. As a result, constructing the environment has a similar complexity as calculating the derivatives.
TDVP-PS
The second evolution scheme based on TDVP is called projector-splitting (PS). The idea of PS is that the tangent space projector in Eq. 19 is invariant under different gauge conditions. More specifically, after canonicalisation of a general MPS in Eq. 1 from site to , becomes the right-hand orthonormal renormalized basis, which is related to by:
[TABLE]
The matrix is an upper triangular matrix in RQ decomposition after the canonicalisation described in Eq. 9. Therefore, the overlap matrix equals and the projector in Eq.21 defined for a general non-canonical MPS is transformed to:
[TABLE]
Similar result can be obtained for :
[TABLE]
On the one hand, this definition of the tangent space projector does not contain any inversion operations of the overlap matrix, which seems to be a remarkable improvement over the first definition in Eq. 20 and Eq. 21. On the other hand, since the gauge is not fixed in different terms of this projector, the integration algorithm described in section TDVP-MU could not be directly applied. Lubich and Haegeman et al. proposed to use a symmetric second order Trotter decomposition to split the formal propagator into the individual terms Lubich, Oseledets, and Vandereycken (2015); Haegeman et al. (2016):
[TABLE]
Based on the propagator in Eq. 42, a single step of time evolution consists of a left-to-right sweep and a subsequent right-to-left sweep each with step size . Taking left-to-right sweep as an example, the matrix at the gauge center is firstly evolved forward in time by applying the projector :
[TABLE]
where and the ingredients , , all have the same definitions as in Eq. 29 30, 31, 32 except that the in Eq. 32 is replaced with or accordingly. Then, the evolved matrix is decomposed by QR as Eq. 9 to obtain the left-canonical matrix and the coefficient matrix . is evolved backward in time by applying the projector :
[TABLE]
Afterwards, the gauge center is moved to site by contracting the evolved and together to obtain . Following the procedure above, the sweep continues until all the individual projectors in Eq. 42 are applied. Inspired by the original two-site DMRG algorithm, it is also possible to formulate TDVP-PS into a two-site algorithm so that the bond dimension could grow up adaptively Haegeman et al. (2016). However, considering the two-site algorithm is much more expensive than the one-site algorithm both in the tensor contraction and QR decomposition, we use the one-site algorithm described above in this paper.
In principle, solving Eq. 43 and Eq. 44 could be accomplished by any ODE integrator such as the RK45 algorithm we use in TDVP-MU. However, since they are linear equations, the Krylov subspace method (Lanczos algorithm for Hermitian operator here) is preferred as it is unitary and is considered to be better than the explicit time-stepping integrators for matrix exponential operator Hochbruck and Lubich (1997); Haegeman et al. (2016). In our calculation, the dimension of the Krylov subspace is adaptive and the Lanczos iteration continues until converges.
TDVP-PS has a similar computational complexity as TDVP-MU (CMF) except for the calculation of derivatives. With simpler EOMs in Eq. 43 and Eq. 44, the calculation of the derivative for forward evolution scales at (labeled as “Deriv-Forward”) and for backward evolution it scales at (labeled as “Deriv-Backward”). So the total scaling for a single step is , where represents the number of Krylov subspace vectors. Besides calculating the derivatives, Krylov subspace method also includes other matrix operations of small size such as calculating the and matrix elements with the Lanczos three-term recurrence relation Arnoldi (1951) and diagonalizing the tridiagonal Hamiltonian in the subspace (labeled as “Krylov”). Other labels for TDVP-PS is the same with that of TDVP-MU.
In summary, since the TDVP-MU scheme and the TDVP-PS scheme are both based on the time dependent variational principle, should be the same if not considering the numerical error. Additionally, the two schemes both require to define a fixed a priori, and additional renormalized basis should be constructed smartly to complement the empty MPS space if the initial state is weakly correlated. Numerically, once is fixed, the error of these two TDVP-based schemes has a monotonic relationship with the time step size. The main difference lies in that TDVP-MU would introduce a minor artificial regularization, while TDVP-PS is inherently free of it. We also notice that TDVP-PS has already been implemented in MCTDH recently Lubich (2015); Kloss, Burghardt, and Lubich (2017); Bonfanti and Burghardt (2018).
II.2 CPU-GPU Heterogeneous Computing
In the TD-DMRG algorithms described above, the hotspots are usually the tensor contractions and the matrix decompositions. GPUs are well suitable for the former not the latter. Therefore, we adopt a CPU-GPU heterogeneous computing strategy. We store the matrices in the GPU memory instead of the host memory, and in most cases call cuBLAS to manipulate them. When doing matrix decomposition, we first transfer the matrix from the GPU memory to the host memory and then use a single CPU core to complete the decomposition, and in the end copy the matrix back. The overhead caused by data transfer is not negligible which we believe could be avoided by a smarter implementation because GPUs are able to run computation and data transfer at the same time.
For comparison, multi-core CPU calculation is also benchmarked. Here, we only consider the simplest strategy of CPU parallelism, which is breaking up the dense matrix computations into subblocks performed automatically by the standard linear algebra libraries such as Intel® Math Kernel Library in our case.
II.3 Computational Details of the FMO Model
The widely used Frenkel-Holstein Hamiltonian Holstein (1959); Schröter et al. (2015) to describe the 7-site FMO model is that:
[TABLE]
where is the exciton creation (annihilation) operator on the th site whose local excitation energy is , is the Coulomb interaction between the th and th site, is the phonon creation (annihilation) operator of vibration mode of th site with vibration frequency and dimensionless electron-phonon coupling strength . Fig. 3 shows a diagram of the model.
We obtain and from previous literature Moix et al. (2011); Schulze et al. (2016) and each site has the same environment with and calculated by equally spaced discretization of the bath spectral density de Vega, Schollwöck, and Wolf (2015) from experiments Wendling et al. (2000); Schulze et al. (2016). For each site, 35 vibration modes are discretized, giving 252 DOFs in total (245 vibrational DOFs + 7 electronic DOFs). The dimension of basis for each mode varies with frequency and most modes have 8 or 4 phonon occupation levels. To minimize the entanglement, the electronic sites in the MPS chain are arranged in the order of [7, 5, 3, 1, 2, 4, 6]. All the local vibrational DOFs are arranged next to the local electronic site as our previous work Ren, Shuai, and Kin-Lic Chan (2018).
We simulate the exciton dynamics at zero temperature. At , an excitation at electronic site 1 is prepared with all vibrations at their lowest energy level, which composes a Hartree product state. Therefore, for simulation with , TDVP-based methods (TDVP-MU and TDVP-PS) should somehow find renormalized basis to complement the empty matrix elements in the initial MPS. To solve this, in all the simulations, we use P&C-RK4 to propagate the state with and time step size in the first 10 steps to . Thereafter, the state is compressed to the pre-defined bond dimension and then evolves with different evolution schemes. Throughout this paper we use atomic unit (a.u.) as the unit of time unless otherwise stated and omit the unit of and for simplicity.
III Results and Discussions
III.1 Accuracy and Time Step Size
To quantitatively evaluate the relative accuracy in our calculations, the mean cumulative deviation of exciton populations at time is used:
[TABLE]
where represents the exciton population at the th site of the 7-site FMO model. is the reference result obtained from TD-DMRG calculation with an appropriate MPS bond dimension and evolution time step size . The standard for choosing and is that the exciton populations have converged on both of them. Such convergence on for the TDVP-PS method is illustrated in Fig. 4 where all curves have already converged on . The population obtained with has already been semi-quantitatively correct although the deviation is visible after . The curves for and almost overlap with each other. To be more discreet, we choose with as the reference for better accuracy. A more comprehensive and quantitative demonstration for the validity of our reference is included in Appendix A. Note that using reference calculated by TDVP-MU (VMF) with sufficiently small regularization parameter rather than TDVP-PS has little effect on results presented in this paper.
The mean cumulative deviation of the three schemes as a function of is plotted in Fig. 5 with . For P&C-RK4 the time step size is 160 which strikes a balance between the RK4 integrator error and the MPS wavefunction compression error. The step size in TDVP-based schemes are chosen such that the error due to the integrator is negligible compared to the restricted . For TDVP-MU (VMF) the step size controlled by the adaptive RK45 integrator (relative tolerance and absolute tolerance are and respectively) spans from 5 to 30. For TDVP-MU (CMF) and TDVP-PS we choose which is a relatively small value to reduce the error introduced by the CMF approximation or Trotter decomposition. We find that at long time scale the TDVP based methods exhibit almost the same error, as they share the same starting point during their derivation, while P&C-RK4 guided by a different philosophy shows larger error than TDVP based methods.
Although TDVP-MU and TDVP-PS are similar in long time limit, subtle differences between the two schemes when are observed, which is illustrated in Fig. 6. TDVP-MU (CMF) is not shown in Fig 6 as it produces the same result with the VMF approach at small step size limit. Fig. 6 shows that TDVP-MU (VMF) is not as accurate as TDVP-PS at short time scale, which is probably caused by the artificial regularization in Eq. 38. Using TDVP-MU (VMF) with sufficiently small regularization parameter as the reference will give the same tendency shown here. When the regularization parameter is decreased from to , which reduces the invasion to the original EOMs, the accuracy of TDVP-MU (VMF) is improved but a smaller time step size is required. However, it should be noted that the error caused by the regularization at the short time regime is several orders of magnitude smaller than the error at the long time regime controlled by the MPS approximation of the wavefunction with a restricted as illustrated in Fig 5. Thus in this case setting to be is more reasonable in terms of practical computation because this setup costs less simulation time than setting to be or without significant precision deterioration. We also compare the new MU regularization scheme with the original regularization scheme and find that the new MU scheme is more accurate with the same as the former studies in (ML-)MCTDH Meyer and Wang (2018); Wang and Meyer (2018).
Despite the fact that TDVP based methods will obtain the same results in the small step size limit and small regularization parameter limit, they behave differently to the adjustment of because their EOMs and integration algorithms differ from each other. Since TDVP-MU (CMF) and TDVP-PS are both second order methods, we investigate how large the time step can be used in these two schemes with . The upper panel of Fig. 7 illustrates the error at which is at the early stage of the evolution. The error of TDVP-MU (CMF) shows a linear relationship with under logarithmic scale and its slope of linear fitting is 2.03, which is in accordance with the second order approximation of TDVP-MU (CMF) . It also indicates that for CMF the integrator error is dominant in this time scale. To the contrary, the error of TDVP-PS is insensitive to within the range of shown in Fig. 7, which means the error due to second order Trotter decomposition is smaller than the error due to MPS ansatz with . In the lower panel, the time to measure the error is at and we observe generally the same trend as in the upper panel, except that when the error of TDVP-MU (CMF) and TDVP-PS are close to each other, indicating that the time step has already converged at this time point and the error is controlled by the MPS ansatz. However, as the upper panel, TDVP-PS allows a time step at least 16 times larger than TDVP-MU (CMF) for converged result. Therefore, though both TDVP-PS and TDVP-MU (CMF) implemented here are second order methods, the prefactor of the error term of TDVP-PS is much smaller than that of TDVP-MU (CMF).
III.2 GPU Acceleration
According to the analysis of computational complexity of the intensive steps in the three time evolution schemes in Section II (a summary is shown in Table 1), we explore how these steps can be accelerated by multi-core CPU and particularly CPU-GPU heterogeneous computing. Although the time cost of each algorithm should vary with different implementations and there certainly are rooms for optimizations in our codes, we believe the data presented in this section reflects the correct tendency. Our benchmark platforms are Intel® Xeon® CPU E5-2680 v4 @ 2.40GHz for CPU-only calculations and Intel® Xeon® Gold 5115 CPU @ 2.40GHz with NVIDIA® Tesla® V100-PCIE-32GB for CPU-GPU heterogeneous calculations. Only one GPU is used in the benchmark.
For the three schemes discussed in this paper with , the wall time cost of a single evolution step () as well as the composed sub-steps are presented in Fig. 8. For TDVP-MU (VMF), there are several steps within . We firstly focus on the total time cost of each scheme in Fig. 8 from which we can learn that for all schemes the parallelization implemented in the standard linear algebra library on CPU is not optimal for TD-DMRG. Using 4 cores can merely double the computational speed, and further improvement is not significant up to using all 28 cores of the CPU. A more inspiring fact is that the CPU-GPU heterogeneous computing is able to boost the speed of time evolution up to 40 times for TDVP based algorithms, although for P&C-RK4 the effect of the GPU is comparable with 4 cores of CPU. We note that when doing calculations on GPU, the GPU usage is not always 100%, so more drastic speedup is expected with even larger bond dimension. Indeed, when the time costs of TDVP-MU (CMF) on single core CPU and the heterogeneous CPU-GPU are 4572 s and 63 s respectively, indicating a 73-fold acceleration.
A closer inspection of the sub-steps shown in Fig. 8 reveals more details about the acceleration. For P&C-RK4 , the bottlenecks of the algorithm are “QR” and “MatMul-QR” as discussed in Section II.1.1. Although “MatMul-QR” can be efficiently accelerated by multi-core CPU and CPU-GPU heterogeneous computing, the same is not true for “QR” and “SVD”, limiting the overall acceleration efficiency. When the calculation runs on the GPU, the matrix multiplication virtually costs no time while “QR” and “SVD” become the bottlenecks. When the GPU is incoperated in TDVP-MU (VMF) , the SVD decomposition for regularization becomes prominent even though it only scales at in a single step. Thus, it is suggested to return back to the original equation in Eq. 28 for higher efficiency once the regularization does not take effect anymore, owing to the fact that calculating the inverse of directly only costs . Even better acceleration efficiency can be achieved in TDVP-MU (CMF) because the SVD decomposition is carried out less frequently. The “Get Env” part also takes less time in CMF than VMF due to the same reason. The time costs of bottleneck sub-steps of TDVP-PS are similar with that of TDVP-MU (CMF) except that the integrator seems to take a large fraction of time especially when running on the GPU. This is considered to be a demonstration of the GPU latency which makes the total time cost of lots of operations on small matrices larger than the total time cost of a few operations on large matrices. When is increased to 256 the relative time cost of the “Krylov” part is diminished. The exact abnormality with GPU acceleration is also manifested to a lesser extent in TDVP-MU (CMF) where the time-consuming “other” part is primarily composed of operations in the Runge-Kutta integrator. Hence, if a smaller bond dimension is employed, devoting more resource on the computation will not gain much benefit. In fact, for TDVP based schemes the absolute wall time with GPU are roughly the same between and . For example, the time costs for a single step of evolution in TDVP-PS with and are 10.8 s and 7.9 s respectively. This phenomenon implies that when the latency of GPU calculation is of the same order with the actual computational time cost. Therefore, with the CPU-GPU heterogeneous computation, TD-DMRG simulation with for a large-scale system would become a routine task in the future.
IV Conclusion
To summarize, in this paper, we carry out numerical benchmarks on three different TD-DMRG time evolution schemes, which are P&C-RK4, TDVP-MU and TDVP-PS, in terms of accuracy based on the vibronic coupling model represented by the FMO complex. After defining the tangent space projector for a general non-canonical MPS, the EOMs of TDVP-MU and TDVP-PS are re-derived from the same starting point but with different gauge conditions. The numerical results demonstrate that TDVP-MU and TDVP-PS could indeed obtain similar accuracy with a converged time step size , while P&C-RK4 is not as accurate as them. However, surprisingly, the converged time step size of TDVP-PS is at least 16 times larger than that of TDVP-MU (CMF), although both of them implemented in this work are second order methods. Regarding computational efficiency, we firstly analyze the complexity of each evolution scheme. To further accelerate the intensive tensor computations in TD-DMRG, we adopt a CPU-GPU heterogeneous computing strategy, in which CPU and GPU are respectively responsible for the tensor decomposition and contraction. We find that for the TDVP-based schemes where the tensor contraction is the main bottleneck, this heterogeneous computing approach is able to speed up these two schemes by up to 73 times (). Finally, we conclude that with the accurate and efficient TDVP-based evolution schemes (especially TDVP-PS) and CPU-GPU heterogeneous algorithm/hardware, TD-DMRG has been a promising method for the large-scale quantum dynamics simulation of real chemical and physical problems.
Acknowledgements.
This work is supported by the National Natural Science Foundation of China through the project “Science CEnter for Luminescence from Molecular Aggregates (SCELMA)” Grant Number 21788102, as well as by the Ministry of Science and Technology of China through the National Key R&D Plan Grant Number 2017YFA0204501. J.R. is also supported by the Shuimu Tsinghua Scholar Program. The authors are indebted to Prof. Garnet Chan for the stimulating discussions. The authors also gratefully thank Mr. Hongde Yu for support on crystal structure of the FMO complex.
Appendix A The validity of the reference state
Here we present the data for the numerical convergence of the reference obtained by TDVP-PS with and . The mean cumulative deviation with various and at and is shown in Table 2, while the same data for and various is shown in Table 3.
From Table 2 and Table 3 we can see that the reference is accurate to a level of approximately for short time scale () and for long time scale (), which is considered to be accurate enough for the error scales discussed in the main text.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Ma, Luo, and Yao (2018) H. Ma, Z. Luo, and Y. Yao, “The time-dependent density matrix renormalisation group method,” Mol. Phys. 116 , 854–868 (2018).
- 2Paeckel et al. (2019) S. Paeckel, T. Köhler, A. Swoboda, S. R. Manmana, U. Schollwöck, and C. Hubig, “Time-evolution methods for matrix-product states,” ar Xiv preprint ar Xiv:1901.05824 (2019).
- 3White (1992) S. R. White, “Density matrix formulation for quantum renormalization groups,” Phys. Rev. Lett. 69 , 2863 (1992).
- 4White (1993) S. R. White, “Density-matrix algorithms for quantum renormalization groups,” Phys. Rev. B 48 , 10345 (1993).
- 5Shuai et al. (1998) Z. Shuai, J. Brédas, A. Saxena, and A. Bishop, “Linear and nonlinear optical response of polyenes: A density matrix renormalization group study,” J. Chem. Phys. 109 , 2549–2555 (1998).
- 6Zhang, Jeckelmann, and White (1999) C. Zhang, E. Jeckelmann, and S. R. White, “Dynamical properties of the one-dimensional holstein model,” Phys. Rev. B 60 , 14092 (1999).
- 7Yao (2017) Y. Yao, “Polaronic quantum diffusion in dynamic localization regime,” New J. Phys. 19 , 043015 (2017).
- 8Barford and Mannouch (2018) W. Barford and J. R. Mannouch, “Torsionally induced exciton localization and decoherence in π 𝜋 \pi -conjugated polymers,” J. Chem. Phys. 149 , 214107 (2018).
