Enhanced parallelization of the incremental 4D-Var data assimilation algorithm using the Randomized Incremental Optimal Technique (RIOT)
Nicolas Bousserez, Jonathan J. Guerrette, Daven K. Henze

TL;DR
This paper introduces RIOT, a parallelized version of incremental 4D-Var using randomized SVD, significantly reducing computation time in atmospheric data assimilation while increasing energy use, with promising results in models and real-world problems.
Contribution
The paper presents RIOT, a novel parallelization method for incremental 4D-Var that replaces CG with RSVD, improving efficiency and enabling faster data assimilation in weather prediction.
Findings
RIOT reduces wall-time by 2-3 times in black carbon source inversion.
Performance gains increase with non-linearities in the Lorenz-96 model.
Analysis covariance computation is 40 times faster with RIOT.
Abstract
Incremental 4D-Var is a data assimilation algorithm used routinely at operational numerical weather predictions centers worldwide.This paper implements a new method for parallelizing incremental 4D-Var, the Randomized Incremental Optimal Technique (RIOT), which replaces the traditional sequential conjugate gradient (CG) iterations in the inner-loop of the minimization with fully parallel randomized singular value decomposition (RSVD) of the preconditioned Hessian of the cost function. RIOT is tested using the standard Lorenz-96 model (L-96) as well as two realistic high-dimensional atmospheric source inversion problems based on aircraft observations of black carbon concentrations. A new outer-loop preconditioning technique tailored to RSVD was introduced to improve convergence stability and performance. Results obtained with the L-96 system show that the performance improvement from…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Enhanced parallelization of the incremental 4D-Var data assimilation algorithm using the Randomized Incremental Optimal Technique (RIOT)
Nicolas Bousserez
University of Colorado
Department of Mechanical Engineering
Boulder, CO, USA
[email protected] &Jonathan J. Guerrette
University of Colorado
Department of Mechanical Engineering
Boulder, CO, USA
NOAA, Chemical Sciences Division
Boulder, CO, USA
&Daven K. Henze
University of Colorado
Department of Mechanical Engineering
Boulder, CO, USA Now at ECMWF, Reading, UK Now at NCAR, Mesoscale and Microscale Meteorology Laboratory, Boulder, CO, USA
Abstract
Incremental 4D-Var is a data assimilation algorithm used routinely at operational numerical weather predictions centers worldwide. The algorithm solves a series of quadratic minimization problems (inner-loops) obtained from linear approximations of the forward model around non-linear trajectories (outer-loops). Since most of the computational burden is associated with the inner-loops, many studies have focused on developing computationally efficient algorithms to solve the least-square quadratic minimization problem, in particular through time parallelization. This paper implements a new method for parallelizing incremental 4D-Var, the Randomized Incremental Optimal Technique (RIOT), which replaces the traditional sequential conjugate gradient (CG) iterations in the inner-loop of the minimization with fully parallel randomized singular value decomposition (RSVD) of the preconditioned Hessian of the cost function. RIOT is tested using the standard Lorenz-96 model (L-96) as well as two realistic high-dimensional atmospheric source inversion problems based on aircraft observations of black carbon concentrations. A new outer-loop preconditioning technique tailored to RSVD was introduced to improve convergence stability and performance. Results obtained with the L-96 system show that the performance improvement from RIOT compared to standard CG algorithms increases significantly with non-linearities. Overall, in the realistic black carbon source inversion experiments, RIOT reduces the wall-time of the 4D-Var minimization by a factor 2-3, at the cost of a factor 4-10 increase in energy cost due to the large number of parallel cores used. Furthermore, RIOT enables reduction of the wall-time computation of the analysis error covariance matrix by a factor 40 compared to a standard iterative Lanczos approach. Finally, as evidenced in this study, implementation of RIOT in an operational numerical weather prediction system will require a better understanding of its convergence properties as a function of the Hessian characteristics and, in particular, the degree of freedom for signal (DOFs) of the inverse problem.
1 Introduction
Inverse problems and data assimilation (DA) in geophysical sciences are often challenging owing to the large size of the parameter and observation spaces. In the context of non-linear Bayesian inference, an analytical solution is often not computationally tractable. Techniques thus have to be designed to efficiently approximate the solution to the inverse problem, which leads to a proliferation of approaches as described in overviews of methods applied throughout geophysics (Carrassi et al., 2018; Asch et al., 2016; Tarantola, 2005), or those specifically related to numerical weather prediction (Kalnay, 2003), remote sensing (Rodgers, 2000), atmospheric chemistry (Bocquet et al., 2015) and biogeochemistry(Rayner et al., 2018). The subset of techniques known as variational (Var) approaches seek to solve for the maximum a posteriori solution (MAP) of the Bayesian problem using iterative minimization procedures. In particular, one such widely used DA technique is the incremental 4D variational approach (Courtier et al., 1994). This method entails a series of linearizations of the system to formulate quadratic optimization problems, whose solutions (the increments) are often sought using conjugate gradient (CG) methods. Each quadratic minimization, called an “inner-loop”, is followed by an “outer loop” iteration wherein the full non-linear model is integrated to propagate the updated increment. The incremental 4D-Var approach is computationally efficient in that the inner-loop minimizations may be performed using simplified, dimension-reduced, and/or coarser resolution tangent-linear (TL) and adjoint (AD) models (Lawless et al., 2008). It has thus been used at operational numerical weather prediction (NWP) centers such as the UK Met Office (Rawlins et al., 2007), the Meteorological Service of Canada (Laroche et al., 2005), Météo-France (Gauthier and Thépaut, 2001), the Japan Meteorological Agency (Honda et al., 2005), and the European Centre for Medium-Range Weather Forecasts (Rabier et al., 2000).
Despite the widespread use of the incremental 4D-Var approach, a drawback (especially when compared to ensemble-based DA techniques (e.g., Evensen, 2009; Houtemaker and Zhang, 2016), is that this method does not easily lend itself to parallelization (i.e., beyond using parallel computing to speed up the evaluation of any single model evaluation). Spatial parallelization of variational DA techniques has long been considered (e.g., Elbern and Schmidt, 1999; Rantakokko, 1997; Trémolet and Dimet, 1996). Moreover, methods have been developed to exploit time parallelism in 4D-Var through sub-division of the assimilation time window (Rao and Sandu, 2016) or through the saddle-point formulation in the weak-constraint algorithm (Fisher and Gürol, 2017), although the convergence of the latter cannot always be guaranteed (Gratton et al., 2018). Most recently, Mercier et al. (2018, 2019) have used block-krylov methods to accelerate an ensemble of variational DA in the context of 3D-Var, where all observations are assimilated at the same time (i.e., the 4D model dynamic is not considered in the assimilation). In this approach, Krylov subspaces computed in parallel for different perturbed 3D-Var ensemble members are concatenated to form a larger Krylov subspace which is used to minimize simultaneously all members, resulting in a speedup of the optimization of 13-45%. However, this framework has not been applied to an ensemble of 4D-Var minimizations, where larger non-linearities are expected that could limit the efficiency of the approach due to significant variation of the Hessian across ensemble members.
In this study, we evaluate a different and novel approach to parallelizing strong-constrained incremental 4D-Var algorithms, which was recently proposed by Bousserez and Henze (2018) – the Randomized Incremental Optimal Technique (RIOT). The key component of the algorithm is the use of randomized singular vector decomposition (RSVD) methods (Halko et al., 2011), which enables parallel model integrations in place of the traditional sequential minimization steps of the inner loops. RIOT thus introduces a new level of parallelization for variational DA that is advantageous when the available computing resources exceed that which can be efficiently used by any single forward model evaluation (e.g., through parallel computation with MPI communications).
Here we evaluate the performance of RIOT against standard incremental 4D-Var techniques that perform the inner-loop minimizations using iterative CG algorithms. Our numerical experiments are based on a Lorenz-96 (hereafter L-96) system, as well as a high-dimensional source inversion problem (recovering emissions of black carbon aerosol based on ambient concentration measurements). In Section 2.1, we present the theoretical background of the incremental 4D-Var method and introduce the different algorithms evaluated in this study. In Section 3, we present and discuss results from numerical DA experiments and assess the potential of RIOT by analyzing its computational and convergence characteristics compared with those of traditional iterative minimization methods. Finally, a summary of our results and a discussion on the limitations and potential of RIOT for use in operational 4D-Var DA systems is provided in the conclusion.
2 Theory and Methods
2.1 Incremental 4D-Var
In data assimilation, we seek the analysis, , that is the Maximum a Posteriori (MAP) solution of a Bayesian problem, whose posterior distribution is proportional to the product of the prior and likelihood probability distribution functions (pdfs). It is common to assume that the prior and likelihood pdfs are Gaussian. More specifically, we assume that is the vector of parameters with prior distribution associated with the background and covariance error matrix , and that is a vector of observations distributed in time and space (4D) with likelihood function associated with a forward model and an error covariance matrix . Here we use the term parameters in a general sense that includes both model state prognostic variables and non-state variables (e.g., boundary conditions). The analysis can be calculated by minimizing the following cost function:
[TABLE]
When the parameter and observation vectors are both high-dimensional and the integration of is computationally expensive, the minimum of (2.1) can be obtained using an iterative algorithm, which is the aforementioned 4D-Var approach. Two popular gradient-based optimization algorithms for large-scale non-quadratic problems are the limited-memory Broyden-Fletcher-Goldfarb-Shanno (L-BFGS) method (Liu and Nocedal, 1989; Byrd et al., 1995) and the truncated Gauss-Newton method.
Incremental 4D-Var (Courtier et al., 1994) is a special case of the latter (Lawless et al., 2005; Gratton et al., 2007) that approximately (Lawless et al., 2008) minimizes a sequence of quadratic cost functions to solve the non-quadratic problem. In short, each quadratic minimization (inner-loop) solution provides a so-called increment that is added to the current parameter vector, from which another non-linear integration is performed (outer-loop) to update the model trajectory. A linearization of the model around that new model trajectory leads to the next quadratic cost function minimization. The process continues until a convergence criteria is reached. The quadratic minimization in incremental 4D-Var consists of solving the following problem:
[TABLE]
where and are the increment and innovation vectors, respectively, and denotes the Jacobian evaluated at . The minimization problem (2) is usually solved using an iterative CG algorithm (Chao and Chang, 1992), for which only the product of the Jacobian or its transpose with a vector is required. Therefore, only tangent linear (TL) and adjoint (AD) model integrations are necessary to solve the minimization problem (2).
2.2 Principle of the Methods
Approximating the Inverse Hessian
It has been shown that the CG method is strongly connected to the Lanczos algorithm (Golub and Loan, 1989). In particular, the solution of (2) obtained from iterations of a CG minimization can be written as follow:
[TABLE]
where is the prior-preconditioned Hessian , is an orthonormal basis of , + denotes the pseudo-inverse and .
Furthermore, the pseudo-inverse of a matrix can be expressed as a function of its singular decomposition, i.e.:
[TABLE]
where and represent the singular vectors and singular values of , respectively. Therefore, one sees from (6) and (7) that iterations of a CG minimization involves an implicit approximated eigenvalue decomposition of .
In Bousserez and Henze (2018), following earlier work by Bui-Thanh et al. (2012), an alternative to using the CG method for minimizing (2) was proposed, in which the Krylov subspace approximating the range of is replaced by a randomized range finder , where the are random vectors drawn from a normal distribution (Halko et al., 2011). The advantage of this approach is that it presents a new dimension for parallelization of the algorithm, since each element of can be computed independently. Another feature of RIOT is that it uses an adaptive increment at each inner-loop based on the approximated spectra of that ensures statistical optimality. The interested reader can refer to Bousserez and Henze (2018) for more details on the optimal adaptive approach.
In addition to these purely sequential and parallel methods, others such as the randomized power iteration (e.g., Halko et al. (2011)) and block-Krylov approaches use both iteration and matrix-block multiplication. Musco and Musco (2015) show that their version of a block-Krylov approach is more efficient for matrix decomposition than the randomized power iteration, which is why we explore the block-Lanczos algorithm as an alternative to CG or RSVD. Block-Lanczos (Golub and Loan, 1989) is designed to solve a linear-least-squares problem where there is statistical uncertainty in the right-hand-side of the equation. In the sequence of quadratic minimization problems of incremental 4D-Var, the right-hand-side is the cost function gradient, the uncertainty of which can be approximated by realizations of the observation () and prior (). In that case the range approximation of becomes , where and is an independent realization of the gradient (see Alg. 2). This approach thus combines aspects of both RSVD (parallelization of the prior-preconditioned Hessian-vector products) and CG methods (construction of Krylov subspaces). Its advantages are that it usually requires a smaller number of iterations than CG and fewer ensemble members than RSVD, therefore enabling the algorithm to adjust to available computational resources.
Preconditioning
The quadratic minimization (2) uses the change of variable . This approach allows one to: 1) implicitly invert the background covariance matrix in (2.1), since this large matrix cannot be represented explicitly; 2) improve the convergence rate of the minimization by reducing the condition number of the Hessian of the cost function. Reducing the condition number of the Hessian through a change of variable is called preconditioning the problem. An additional level of preconditioning is possible by exploiting the approximations of the Hessian obtained as a by-product in each inner-loop minimization (Tshimanga et al., 2008). More specifically, the preconditioned problem consists of solving:
[TABLE]
where represents the condition number of a matrix, i.e., the ratio between its largest and smallest eigenvalues. Ideally, , in which case the CG minimization converges in one iteration. This is because by construction the CG algorithm has converged when the Krylov subspace becomes invariant under multiplication by . Therefore, a good strategy is to choose a preconditioner such that: . A standard approach is to use the following spectral preconditioner:
[TABLE]
where the and approximate the eigenvectors and eigenvalues of , respectively (Tshimanga et al., 2008). The efficiency of the spectral preconditioner has been verified, e.g., in the ECMWF 4D-Var operational DA system, where it decreases the wall-time of the minimization by about 20% compared to the non-preconditioned case (personal communication).
On the other hand, the RIOT algorithm relies on an approximation of the inverse Hessian (see Eq. 6) using random samples of the range of . As a consequence, the subspace (defined by ) in which the cost function is minimized is agnostic to the right-hand term of (8) and will thus be suboptimal (i.e., for a given rank the residual error is not minimized). Furthermore, assuming weak non-linearities, the inverse Hessian approximations computed from RIOT in (6) will tend to be similar across outer iterations, although the modes resolved in previous outer iterations will have only small contributions to the current inner loop solution. This may prevent the algorithm from further reducing the cost function beyond the first outer iteration and thus, convergence of the minimization.
An additional concern is related to the eigenvalue spectra. The spectral preconditioner (8) transformation tends to cluster around 1 the eigenvalues associated with the modes resolved in previous outer iterations, which can be used to filter them out in the RSVD Hessian approximation of subsequent iterations to more efficiently (using fewer ensemble members) extract the remaining unresolved modes with eigenvalues greater than 1. However, the accuracy of the RSVD method itself is somewhat at odds with this approach. While RSVD best approximates the dominant eigenmodes of the targeted matrix (Halko et al., 2011), the accuracy of the estimation is degraded for flat eigenvalue spectra, i.e., when the rate of decrease of the eigenvalues is small. As a consequence, the decrease in the Hessian condition number across outer iterations from the spectral preconditioner can result in poor minimization performances in RIOT. One solution to mitigate the interference of previously resolved modes in the RSVD approximation is to restrict the sampling of the Hessian matrix to the subspace orthogonal to those modes. This improvement will be referred to as the rotation method in the following. The rotation technique is presented in Alg. 3.
The Adaptive Approach
The RIOT algorithm computes a low-rank RSVD approximation of the (prior-preconditioned) Hessian, from which an approximated posterior increment and posterior error covariance matrix are derived. Based on the approximated eigendecomposition of the Hessian, two updates are possible for the posterior error covariance matrix, as follows (Bousserez and Henze, 2018):
[TABLE]
where the are the eigenvectors and values, respectively, of the prior-preconditioned Hessian , in decreasing eigenvalue order, and where defines the truncation number for the eigenvalue decomposition. Here the updates LRU and LRA stand for low-rank update and low-rank approximation, respectively, since (11) consists in an update to the prior covariance matrix and is therefore full-rank, while (12) is a rank- approximation to the posterior covariance matrix. The corresponding LRU and LRA posterior increment approximations are obtained by using (11) and (12), respectively, in the increment formula:
[TABLE]
Assuming an exact truncated eigenvalue decomposition (TSVD) of the Hessian is used in (11) and (12), statistical optimality results can be used to design an adaptive approach to select either the LRU or LRA updates based on the approximated eigenvalue spectra. Thus, the following selection strategy is used in the RIOT algorithm:
[TABLE]
The reader can refer to Bousserez and Henze (2018) for more information on the theory.
2.3 Algorithms
We recall below the RIOT algorithm, which we shall evaluate in this study. A variant of RIOT for the dual formulation of 4D-Var (RIOT-PSAS) is provided in the Appendix. The standard incremental CG algorithm (VarCG), as well as the Block-Lanczos method (VarBL) are also reformulated below so as to facilitate direct comparisons between methods. VarCG is presented using the equivalent Lanczos recurrence in the inner loop for simple comparison to the block variant and to recall the derivation of the approximate Hessian eigen modes. The need for re-orthogonalization is well-documented for both sequential and block Krylov methods (Golub and Loan, 1989), and these are applied for both VarCG and VarBL. For the sake of readability, the spectral preconditioning steps are not included for the VarCG, VarBL and RIOT-PSAS methods, but only presented in the RIOT algorithms. The VarCG minimization algorithm can be written as follows:
As explained previously, the VarBL method realizes a similar algorithmic parallelization as RIOT, but also requires iterative constructions of Krylov subspaces similar to VarCG. The inner-loop minimization follows the Block Lanczos algorithm given by Golub and Loan (1989), where the starting block is comprised of realizations of the cost function gradient. VarBL can be written as follows:
Finally, the RIOT algorithm, which includes preconditioning and sampling improvement (rotation) options, is as follows:
3 Numerical Experiments
3.1 Lorenz-96 System
3.1.1 Description
The nonlinear dynamical Lorenz-96 system is frequently used to test new data assimilation methods, and is defined as follows (Lorenz, 1996):
[TABLE]
where represents the state of the system, , , and is a forcing term.
The dimension, , and forcing, , are free parameters of the Lorenz-96 system. Often is set to and to ensure chaotic dynamics. In our test , but in order to increase the potential degrees of freedom (DOFs). The initial true state, , was obtained from a spin-up simulation that was initialized with and perturbation of at , then integrated forward in time for 144,000 time steps. In this experiment was used as the truth in a series of synthetic observation inversions where the pseudo-observations are obtained by integrating the model from , i.e., . In this case 100 observations are distributed randomly in space and time. Two assimilation windows are considered in our experiment, a 6-hour, 48-hour, 72-hour, and 96-hour windows with total integrations of . All four scenarios use a time-step of . The prior error variances are defined as and the background error correlations as Gaussian functions with a length-scale of 1.5 grid cells. Observation error variances are defined as . Observation errors are assumed to be uncorrelated. The observations used in the assimilation are obtained by perturbing the pseudo-observations according to their error statistics: , where . Similarly, the prior used in the inversion is obtained by perturbing the truth: where .
3.1.2 Results and Discussion
Figures 1, 2, 3 and 4 show the non-quadratic cost function values at the beginning of each outer iteration for VarCG, VarBL, and for RIOT, for the 6-hour, 48-hour, 72-hour, and 96-hour L-96 problem, respectively. Also included is a reference solution computed using Gauss-Newton and the exact Hessian. For the 6-hour experiment, the VarCG minimization converges in the first outer iteration (i.e., after only 7 inner iterations ), which reflects the strongly linear behavior of the system. The performance of the RIOT algorithm is sensitive to both the number of samples used in the RSVD () and the preconditioning. For example, the top row of Fig. 1 shows that when using 35 samples in the RSVD the minimum of the cost function is not reached after 10 outer iterations, whereas it is reached when 75 samples are used.
Using the spectral preconditioning in RIOT significantly improves the convergence rate of the minimization in the 6-hour case, especially when using 35 or 50 samples in the RSVD. For instance, the spectral preconditioning technique enables RIOT with 35 samples to convergence in about 6 outer iterations, similar to the non-preconditioned RIOT case with 75 samples. Preconditioning RIOT thus mitigates the need to increase the ensemble size in the RSVD. For the 6-hour window problem, the rotation approach does not improve the results compared to those obtained with spectral preconditioning alone.
The same is not true for the more non-linear cases (48-hour to 96-hour windows), where the spectral preconditioner produces spurious increases in the cost function, in some cases resulting in the divergence of the RIOT algorithm (72-hour and 96-hour cases). As explained in section 2.2, the preconditioner reduces the condition number of the Hessian, which can degrade the performance of the RSVD in approximating the dominant eigenvectors by increasing the interference of weaker modes. Introducing the rotation step allows one to efficiently filter out part of that noise by reorienting the RSVD input samples in the direction of the dominant Hessian modes. The preconditioning and rotation approaches appear to be beneficial to VarBL as well in strongly non-linear regimes (see 96-hour window results).
In all L-96 experiments, combining the spectral preconditioner with the rotation method in the RSVD enables RIOT to converge to the Gauss-Newton reference solution either notably more rapidly (for the weakly non-linear case) or as rapidly (highly non-linear case) as the non-preconditioned case. Conversely, the spectral preconditioner applied to VarCG does not improve the performance of the minimization. As noted by Tshimanga et al. (2008), using a spectral preconditioner that implicitly assumes exact eigenvectors computation in place of the more consistent Ritz preconditioner can result in poorer performances than obtained from a non-preconditioned VarCG minimization. The impact of using the Ritz preconditioner instead of the spectral preconditioner is beyond the scope of this paper. Note that tests conducted with the rotation modification only in RIOT (i.e., without spectral preconditioning) did not show any improvement of the convergence of the minimization except for the almost linear case corresponding to the 6-hour window (results are not shown here).
In conclusion, although for a given rank of the Hessian approximation (i.e., number of samples in VarBL and RIOT, or number of iterations in VarCG) VarCG provides better minimization performances than either RIOT or VarBL, the preconditioning techniques applied to the latter methods ensure convergence in both linear and non-linear regimes, even using a moderate number of samples for the RSVD Hessian approximations. A noteworthy characteristic of the spectral preconditioning with rotation used in RIOT or VarBL is its stability compared to all other approaches. This much desirable property of our preconditioning technique is clearly observed in the strongly non-linear cases (Fig. 3 and 4), where both RIOT and VarBL exhibit erratic behavior of the minimization with sudden increases in the cost function value when no preconditioning is applied. Lastly, as discussed in section 2.2, both VarBL and RIOT enable highly parallel implementations of incremental 4D-Var, the computational wall-time of the minimization has the potential to be drastically reduced compared to the sequential VarCG method, provided enough nodes are available to run many instance of the TL and AD models in parallel. In the next section we illustrate this aspect of the performance of RIOT in a more practical context using two large-scale atmospheric source inversion problems.
3.2 Application to an Atmospheric Source Inversion Problem
In this section we further analyze the properties of the RIOT algorithm in a more realistic framework by considering two large-scale atmospheric source inversion problems. Both experiments consist of optimizing regional black carbon emissions using aircraft measurements from the ARCTAS-CARB (Arctic Research of the Composition of the Troposphere from Aircraft and Satellites in conjunction with the California Air Resources Board) campaign.
3.2.1 Performance Evaluation for a low DOFs problem
Description
The small black carbon source inversion problem (referred to as small BC thereafter) uses real data from an SP2 device measuring black carbon on a single flight from ARCTAS-CARB on 22 June, 2008 [ref]. The experiment is identical to FINN, as described in Guerrette and Henze (2017). The prior inventories are derived from FINN biomass burning (BB) emissions and NEI 2005 anthropogenic (ANT) emissions. Scaling factors for the BC emissions resolved hourly and across grid cells are optimized assuming a prior value of 1. A factor of 3.8 uncertainty for BB and factor of 2 for ANT are prescribed, both with a 36 km horizontal correlation length scale. The results here are slightly different from those described in Guerrette and Henze (2017), because since that paper was published errors in the trajectories associated with the solar radiation treatment have been diagnosed and fixed. Improved symmetry between TL and AD used for this study give more accurate results than those in Guerrette and Henze (2017). The previously existing asymetry was severe enough to prevent RSVD from converging effectively, but was less detrimental to VarCG. The total number of observations considered for the small BC problem is 241. While the total dimension of the control vector was 299568, the number of grid columns with active emission values was 121912 and the number of grid cells with significant emission values is far fewer. As a result, the small BC problem has a low DOFs of 17.9 (see Fig. 7). For a problem of those dimensions, the exact computation of both the posterior mean emissions and the posterior error covariance matrices is possible. Therefore the approximations obtained from the different minimization algorithms can be evaluated against the exact solutions. For the small BC problem, the VarBL algorithm was not tested, therefore only results obtained with the VarCG and RIOT methods are presented and discussed. As described in Guerrette and Henze (2017), a lognormal transformation is applied to the emission control vector in order to enforce positiveness of the solution. Therefore, although the transport model is a linear solver, non-linearities are introduced in the resulting forward model that requires the use of non-quadratic minimization algorithm such as incremental 4D-Var.
Results and Discussion
In this section we first analyze the behavior of VarCG and RIOT for the inner loop minimization. In particular, we assess the impact of the adaptive approach used in RIOT in order to guarantee statistical optimality of the approximated posterior increment and posterior covariance matrices.
Figure 5 shows the squared Euclidian norm of the error in the posterior increment as a function of the rank of the Hessian approximation for the first outer iteration of the 4D-Var minimization. The rank of the Hessian approximation is equal to the number of iterations in the case of VarCG, and to the number of approximated eigenvectors retained in the RSVD Hessian approximation in the case of RIOT. Figure 5 also presents the performance of RIOT for different values of the oversampling parameter , whose value defines the total number of samples used in the RSVD approximation.
Results from a Hessian approximation with exact truncated eigendecomposition (TSVD), where only the first eigenvectors are retained are also shown for each rank . The RIOT increments show very noisy features in the case where no oversampling is used in the RSVD (). It appears that a small overampling parameter of is enough to obtain an increment similar to the adaptive approach applied to the exact truncated SVD approximation, while yields almost perfect agreement. This is consistent with Halko et al. (2011), in which oversampling parameters of were sufficient to ensure good convergence properties in most cases. As expected from their statistical properties (see Bousserez and Henze (2018)), TSVD-LRA is associated with smaller increment errors than TSVD-LRU from rank 1 to 12, which correspond to prior-preconditioned Hessian eigenvalues (not shown), while TSVD-LRU outperforms TSVD-LRA for higher ranks. One observes that the adaptive approach applied to TSVD provides a solution that is close to the optimal (minimum increment error) one. Nevertheless, the adaptive selection of the LRU update occurred 3 ranks too late (at rank 15) compared to the optimal transition point. This can be explained by both the statistical nature of the criteria, that is, the optimality is valid in average, not necessarily for a particular realization of the prior and observations, and the fact that the optimal transition depends (even in average) on the full spectra of the Hessian. More specifically, while in average TSVD-LRU will always correspond to smaller errors than TSVD-LRA for truncation indices associated with eigenvalues , the reverse does not necessarily hold for . As a result, in general the rank associated with the optimal transition from TSVD-LRA to TSVD-LRU can be smaller than the one inferred from the RIOT adaptive criteria. Those results suggest another, deterministic method for the adaptive approach. Assuming a maximum of samples can be afforded for each inner iteration in RIOT, once the rank- RSVD approximation of the Hessian is available it is possible to perform two parallel integrations of the non-linear forward model, one using the TSVD-LRA and the other the TSVD-LRU increment update, in order to evaluate the non-quadratic cost function J. The increment update corresponding to the smallest J would then be selected, and the same procedure applied to the next inner loop. In practice this would require an additional (possibly parallel) integration of the non-linear model compared to the current RIOT algorithm.
For a given rank , the VarCG method significantly outperforms both RIOT and the TSVD approach, reaching convergence after only 10 inner iterations. This can be explained by the fact that the CG algorithm ensures minimum error residuals at each iteration (Golub and Loan, 1989). In particular, a key difference between the RSVD-based or TSVD Hessian approximations and the implicit Hessian approximation obtained from the CG minimization is that the latter method exploits the gradient information (i.e., in the right-hand side of (6).) While RIOT utilizes approximations of the modes of the Hessian in order of decreasing eigenvalue, CG tends to approximate Hessian modes that provide significant contribution to the reduction of the error residual. In this experiment, since only about samples were needed for RIOT to converge (the convergence criteria being defined as ), all randomized SVD integrations were run in parallel. This yielded about an order of magnitude decrease in wall-time compared to the 10 iterations required for the VarCG algorithm.
Figure 6 summarizes the performance of different VarCG and RIOT configurations in terms of cost function reduction, number of cores used per model integration and wall-time. Two metrics are considered, a local one based on computational performances per outer iteration, and a global one associated with performances for the entire minimization. A total of 15 outer iterations was used in those experiments, after which a minimization is considered to have converged if its final cost function value is within % of the minimum cost function across all methods. The selected threshold corresponds to the upper bound of cost functions values among all configurations that plateaued after 15 iterations. In terms of number of outer iterations required to reach convergence, RIOT with 40 RSVD samples () yields similar performance as the VarCG algorithm using 10 inner iterations (). While VarCG with uses 96 cores per model integration, RIOT with uses about 10 times as many cores due to the 40 parallel model integrations running on 24 cores each (Fig. 6 d)). However, RIOT with enables a threefold decrease in wall-time per outer iteration compared to VarCG with (Fig. 6 d)). It should be noted that the wall-time performance of the MPI parallelization is intrinsically limited by the increase in communications relative to useful computations as more cores are used. In the case of the WRF model integrations, the robustness of those communications breaks down for more than 96 cores on the small regional model domain considered. Therefore, RIOT provides an additional level of parallelism that allows one to increase the wall-time performance beyond that obtained from a model-level MPI parallelization (Fig. 6 b)).
The effect of preconditioning in RIOT is also shown in Fig. 6. A significant increase in the rate of convergence of the minimization is obtained when using the spectral preconditioner with rotation (see Sec. 2.2) for RIOT with , while the preconditioning technique slightly degrades the performance of the minimization compared to no preconditioning when . For (not shown), the degradation obtained from preconditioning RIOT was even larger. These results illustrate two competing effects of the spectral preconditioning: reduction of the condition number that degrades the performance of RSVD methods and filtering of the Hessian modes resolved in previous outer iterations that enables the RSVD to focus on the remaining eigenmodes. For , the rate of decrease of the preconditioned Hessian eigenspectra remains fast enough across the first outer iterations to obtain an accurate RSVD approximation of the unresolved modes. As the rank of the Hessian approximation is increased the eigenspectrum of the preconditioned Hessian tends to be flatter, which can generate noisy RSVD approximations. The degradation in the accuracy of the RSVD Hessian estimate can then offset the benefits obtained from filtering out the previously resolved modes using the spectral preconditioning. Therefore, as a general rule, the use of a spectral preconditioner in RIOT will tend to be beneficial for a Hessian with fast decreasing eigenspectrum and when the rank of the Hessian approximation is small (compared to the number of eigenvalues with significant gaps). Figure 7, which shows the eigenspectrum of the prior-preconditioned Hessian in the first outer iteration of the minimization, further supports this interpretation. Indeed, when the rank of the Hessian approximation is very small (e.g., ) compared to the maximum eigenvalue indice associated with fast decreasing spectrum (), one obtains a clear benefit from using the preconditioner in RIOT. Conversely, when the rank of the Hessian approximation corresponds to eigenvalue indices located in the flat regime of the spectrum (e.g., ), a non-preconditioned RIOT will outperform a preconditioned one.
Figure 8 summarizes the performances of the different minimization configurations in terms of both the wall-time and a metric directly related to energy consumption. The energy-related metric chosen here is the product of the number of cores simultaneously used and the wall-time, either per outer iteration (Fig. 8 a)) or per converged minimization (Fig. 8 b)). In term of per outer iteration cost, the RIOT configurations with 2 samples () performed best in term of both walltime and energy cost, followed by VarCG with (i.e., two inner-loop iterations) and cores per model integration. However, when considering the whole minimization (Fig. 8 b)), VarCG configurations with more inner iterations () and at least 48 cores performed better than VarCG with iterations. Although it required more than 4 times as much energy demand as the fastest VarCG configuration (), RIOT with samples allowed to reduce the walltime by more than a factor of 2 compared to the latter, making this algorithm an attractive approach when energy cost is not a significant limitation of the system considered.
In addition to the performance of the 4D-Var minimization, it is useful to compare the computational efficiency of RIOT and VarCG for estimating the posterior error covariance matrix, or, equivalently, the inverse Hessian of the cost function at its minimum. As described in Sec. 2.2, another useful application of Hessian approximations is the preconditioning of the inner loop minimizations in incremental 4D-Var.
Figure 9 shows the Frobenius norm of the relative error approximation in the posterior error covariance matrix (Bousserez and Henze, 2018) as a function of the rank of the Hessian approximation for different methods. The rank of the approximation refers to either the number of iterations in the case of CG, the number of RSVD samples in the case of RIOT, or the number of eigenmodes retained in the case of TSVD. The posterior covariance approximation error for TSVD is shown for both the LRA and LRU updates, as well as for the adaptive method that is used in the RIOT algorithm (see Sec. 2.2). The TSVD results clearly illustrate the importance of using the adaptive approach to define the posterior covariance approximation, with a fast decreasing error for LRU reaching a magnitude about 15 times smaller than that of the LRA estimates for ranks greater than 13 (where the adaptive transition to the LRU approximation occurs). However, it appears that the LRA estimate yields larger approximation errors than LRU even before the rank-13 approximation (that is, for ranks greater than 5). As explained in Bousserez and Henze (2018), the superiority of the truncated LRU approximation over LRA is guaranteed only when the last mode used in the approximation is associated with an eigenvalue smaller than one. In the case where the smallest eigenvalue in the approximation is greater than one, using the LRA approximation may lead to a suboptimal choice, as we find here. The efficiency of the adaptive method, that is, its ability to select the optimal rank for the transition to the LRU approximation depends on the characteristic of the Hessian spectra itself. In particular, the adaptive approximation is more likely to generate more optimal posterior covariance estimates for fast decreasing eigenspectra than for relatively flat eigenspectra. In practice, for a given inverse problem (e.g., data assimilation for NWP, atmospheric source inversion), one might adapt the LRA to LRU transition criteria to maximize the performance of the approximations based on the characteristics of the Hessian eigenspectra.
Figure 9 shows that both the RIOT and VarCG methods provide approximations to the posterior error covariance that are close to optimality for ranks greater than 20. We note that the VarCG approximation is associated with smaller errors than TSVD adaptive or RIOT from rank 8 to 12, which is due to an earlier transition to the LRU approximation in the case of VarCG (not shown here). This improvement is therefore an artifact of the underestimation of the true eigenvalues for indices 8 to 12 by the Lanczos-CG algorithm. The fact that the Lanczos method tends to approximate the extrema of the eigenspectrum (i.e., maximum and minimum eigenvalues) is well-known (Golub and Loan, 1989). The underestimation of the smallest approximated eigenvalues for a given rank that we obtain here is thus consistent with the theory. For indices greater than 12, the exact eigenvalues quickly reach a minimum plateau (not shown here), which greatly mitigates the above-mentioned effect and explains the fast convergence of the VarCG posterior error covariance approximations to the TSVD approximations.
Finally, it is worthwhile to note that for all methods convergence to the exact posterior error covariance matrix is obtained for a rank 40 only. While it requires 40 iterations in the case of VarCG, it requires 40 parallel TL and AD integrations in the case of RIOT (i.e., a wall-time equivalent to one VarCG iteration). Thus, provided enough cores are available to perform all parallel RSVD integrations at once, RIOT allows one to reduce the wall-time cost of the approximation by a factor 40. This should be contrasted with the posterior increment approximation, for which the VarCG algorithm is more competitive due to its ability to optimally exploit gradient information and to enforce minimum residual error at each iteration through the conjugacy property.
3.2.2 Scalability for Problems With Larger DOFs
Description
The large black carbon source inversion problem (referred to as large BC thereafter) is designed to test the scalability of the RIOT algorithm for problems involving larger DOFs. To this aim, a large number of pseudo-observations are generated by fictitiously increasing the number of aircraft measurement. For this scenario, the domain covers California between 00:00Z and 23:59:59Z on 22 June 2008. Four aircraft trajectories from the ARCTAS-CARB campaign are spread out over California as described in Guerrette and Henze (2017). A hypothetical plane is launched on each of those trajectories at approximately the beginning of each hour between 02Z and 15Z (9PM PST on 21 Jun 2008 to 10AM PST on 22 Jun 2008 PST) for a total of 14*4 = 56 flights. This setup provides a total of 12,796 grid-scale observations of black carbon. As a result, the large BC problem has a higher DOFs of about 219 (see Fig. 10).
These observations were simulated in a base case that uses FINN biomass burning (BB) emissions and NEI 2005 anthropogenic (ANT) emissions. In our observation system simulation experiment (OSSE), the prior was defined by perturbing the inventories using a factor of 3.8 uncertainty for BB and factor of 2 for ANT, both with a 36 km horizontal error correlation length scale. The BC observations were perturbed according to their uncertainties with no correlation. Two independent additive error sources are assumed for the observations: a uniform 12% relative uncertainty and an absolute uncertainty of 0.02 g/m3. Other than the observations and OSSE setup, the experiment follows the settings for FINN described in Guerrette and Henze (2017). As explained in Section 3.2.1, the lognormal transformation applied to the emission control vector makes the cost function to minimize non-quadratic, which justifies the use of the incremental 4D-Var approach.
Results and Discussion
Figure 11 shows the convergence performance for the non-quadratic cost function for different VarCG and RIOT configurations, as a function of the elapsed walltime of the minimization, the number of outer loop iterations or the number of cores. Similarly to previous results for the small BC experiment, the threshold for convergence was defined based on the maximum cost function reduction across all methods, such that all cost functions that are within % of the minimum cost function after 15 iterations are considered as converged.
Both VarCG with inner iterations and RIOT with samples show the fastest convergence in 4 outer iterations, followed by VarCG with inner iterations in 3 outer iterations. The improvement from the preconditioning technique in RIOT is evident especially when samples are used in the RSVD. In the case with samples, the preconditioning allows further decrease of the cost function beyond outer iteration 5, in contrast with the non-preconditioned case for which the cost function plateaus. Preconditioning RIOT when samples are used in the RSVD degrades the performance of the minimization (not shown here), which is consistent with the results obtained for the small BC problem when the rank of the Hessian approximation was increased (see Sec. 3.2.1). Again, this shows that the threshold to determine if the preconditioning should be used in RIOT or not for a given number of samples will depend on the specific inverse problem at hand. Figure 10 shows the eigenspectrum of the prior-preconditioned Hessian in the first outer iteration of the minimization. Similar to what was found in the case of the small BC problem (see Sec. 3.2.1), the Hessian eigenspectrum can help illustrate the impact of the preconditioner in RIOT and can be used as a diagnostic tool to adapt the preconditioning technique to specific inverse problems. In this example, the preconditioner is beneficial for ranks of the Hessian approximation corresponding to eigenvalue indices situated in the fast decreasing regime of the spectrum (e.g., ), while for higher ranks () the non-preconditioned minimization performs better.
Figure 11 also shows the impact of exploiting the model-level parallelization in VarCG. As the number of cores per model integration is increased from 24 to 96 (i.e., by a factor 4) in VarCG with inner iterations, the wall-time to reach convergence is reduced by more than a factor of 2 (see pannel b)). As discussed in Section 3.2.1, the MPI parallelization limit for WRF-DA being defined by 96 cores, these results show that RIOT can be used to further reduce the wall-time cost by a factor 2.5 at the expense of mobilizing about 37 times more cores simultaneously during the RSVD parallel integrations.
Similar to the results shown in Sec. 3.2.1 for the small BC problem, Fig. 12 summarizes the performances of the different minimization configurations in term of both the wall-time and a metric related to energy consumption (i.e., the product of the number of cores and the elapsed walltime). Although RIOT with samples allows to drastically reduce the wall-time per outer iteration compared to VarCG without increasing the energy cost, a significant increase in the number of outer iterations is required to reach convergence when less than samples are used. As a consequence, at least samples are necessary to take advantage of the RIOT algorithm, in which case the wall-time can be reduced by a factor of about 2.5 compared to the best VarCG configuration ( with cores), at the expense of about an order of magnitude increase in energy cost.
Note that in our experiment the cost of the TL and non-linear forward model integrations are approximately the same. It is worth mentioning that in an operational context (e.g., ECMWF’s system) the TL and AD models in incremental 4D-Var usually correspond to a simplified version of the non-linear forward model and are run on a much coarser grid resolution to make the minimization computationally tractable. Therefore, similar numerical experiments conducted with such systems could lead to different conclusions.
4 Conclusion
In this study a new technique for incremental 4D-Var systems, the Randomized Incremental Optimal Technique (RIOT), is introduced and evaluated using both a Lorenz-96 (L-96) system and a realistic black carbon source inversion problem based on WRF-DA. This method allows one to replace the standard iterative Conjugate-Gradient (VarCG) method in each quadratic inner loop minimization by a fully parallel randomized singular vector decomposition (RSVD) algorithm. RIOT exploits the optimality results established by Bousserez and Henze (2018) and Spantini et al. (2015) to compute the dominant modes of the prior-preconditioned Hessian using an RSVD method. Those eigenmodes are used to build approximations of both the posterior increments and posterior error covariance matrices, which are adaptively constructed based on the characteristics of the Hessian spectrum.
For the L-96 problem, quasi-linear to very non-linear regimes were considered. Provided enough samples (i.e., parallel tangent-linear and adjoint integrations) are used in the RSVD, RIOT shows non-quadratic optimization performances comparable to the VarCG method. Our results demonstrate that the computational efficiency of RIOT relative to VarCG increases with the non-linearity of the L-96 system. The need for an outer loop Hessian preconditioning technique in RIOT has also been highlighted as well as the importance of adapting the traditional spectral preconditioner to the RSVD approach. In our experiments, the Hessian spectral preconditioner is combined with a so-called rotation technique in order to improve the RSVD sampling across outer iterations. For the L-96 problem, this preconditioning technique significantly improved the performance of the optimization and proved to be much more stable than either the spectral preconditioner alone or a non-preconditioned minimization, especially when the non-linearity of the system is increased.
For the black carbon source inversion problems, our analysis shows that for a given rank of the approximation (i.e., number of inner iterations in VarCG and number of RSVD samples in RIOT) the VarCG method generates posterior increments with systematically smaller errors than those obtained from RIOT, which stems from the minimum error residual property enforced by construction in VarCG. However, RIOT is shown to produce equivalent solutions using a Hessian approximation of rank about three times as large as the number of VarCG inner iterations. Assuming a large number of RSVD samples can be run in parallel (i.e., a large amount of cores can be used simultaneously), the overall wall-time cost of the minimization in the black carbon inversion problems can be reduced by a factor 2 to 3 when using RIOT instead of an iterative VarCG minimization, at the expense of increasing the energy cost by a factor 4 to 10. Increasing the number of observations in the black carbon pseudo-inversion experiment so that the DOFs is at least an order of magnitude higher doubled the energy cost of the minimization. Our results also showed that the use of the spectral preconditioner with rotation can be detrimental to the minimization compared to a non-preconditioned approach when the rank of the Hessian approximation becomes large and the Hessian has a fast-decreasing spectra. Therefore, for a given inverse problem, a more tailored preconditioning approach based on the characteristics of its Hessian eigenspectra will be necessary to ensure efficient RIOT implementations. In the case of the posterior covariance errors estimates, RIOT is shown to be even more competitive with respect to the VarCG method, with about a factor of 40 improvement in wall-time performance for similar quality of the approximation.
In order to assess the potential of RIOT for operational NWP DA applications, further numerical experiments are in progress based on ECMWF’s incremental 4D-Var system. One limitation of RIOT is that the performance of the minimization will strongly depend on the characteristic of the Hessian spectra. Although inverse problems with fast decreasing Hessian spectra are associated with good performances of the RSVD approach, DA problems in NWP are known to generate flatter Hessian spectra, in particular due to the availability of large amount of data with similar weights in the assimilation. In such cases a compromise between Krylov subspace iterative algorithms and randomized parallel methods, such as the use of randomized power iterations (Halko et al., 2011) or the Block-Lanczos algorithm (Golub and Loan, 1989) to approximate the Hessian, is a promising approach to mitigate the dependence of the performance of the RIOT algorithm on the Hessian spectra. Preliminary results obtained from a Block-Lanczos approach and presented in this study in the context of the L-96 problem demonstrate the strong potential of such hybrid iterative-randomized methods and should be further investigated in the context of more realistic large-scale inverse problems. Another useful application of the RSVD approach in the incremental 4D-Var context would be to use randomized Hessian decompositions to precondition the outer loop. Indeed, the possibility to generate Hessian approximations of much higher rank using randomized decomposition techniques compared to iterative Krylov methods (inherently limited by their sequential nature) has a strong potential. Moreover, posterior square-root sampling methods used to allow for flow-dependency in cycling 4D-Var DA and that currently rely on Hessian approximations based on iterative minimization algorithms (e.g., BFGS or VarCG) are limited by the low-rank nature of the estimates, as suggested by Auligné et al. (2016). The performance of the RSVD to approximate the posterior error covariance matrix presented in this study clearly demonstrates the potential of this approach to address those limitations.
Acknowledgments
The authors acknowledge support from the NOAA grant NA16OAR4310113. The authors acknowledge the NOAA Research and Development High Performance Computing Program for providing computing and storage resources that have contributed to the research results reported within this paper (url:https://rdhpcs.noaa.gov). This research was performed while J. Guerrette held an NRC Research Associateship award at NOAA/ESRL. Nicolas Bousserez performed part of the research while working on the CHE project at ECMWF. The CHE project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 776186.
Appendix
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Asch et al. (2016) Asch M, Bocquet M, Nodet M. 2016. Data assimilation: methods, algorithms, and applications. SIAM 11 .
- 2Auligné et al. (2016) Auligné T, Ménétrier B, Lorenc AC, Buehner M. 2016. Ensemble-variational integrated localized data assimilation 144 (10): 3677–3696.
- 3Bocquet et al. (2015) Bocquet M, Elbern H, Eskes H, Hirtl M, Žabkar R, Carmichael GR, Flemming J, Inness A, Pagowski M, Pérez Camaño JL, Saide PE, San Jose R, Sofiev M, Vira J, Baklanov A, Carnevale C, Grell G, Seigneur C. 2015. Data assimilation in atmospheric chemistry models: current status and future prospects for coupled chemistry meteorology models. Atmospheric Chemistry and Physics 15 (10): 5325–5358, doi: 10.5194/acp-15-5325-2015 .
- 4Bousserez and Henze (2018) Bousserez N, Henze D. 2018. Optimal and scalable methods to approximate the solutions of large-scale Bayesian problems: theory and application to atmospheric inversion and data assimilation. Quarterly Journal of the Royal Meteorological Society 144 (711): 365–390.
- 5Bui-Thanh et al. (2012) Bui-Thanh T, Burstedde C, Ghattas O, Martin J, Stadler G, Wilcox LC. 2012. Extreme-scale UQ for Bayesian inverse problems governed by pdes : 1–11doi: 10.1109/SC.2012.56 .
- 6Byrd et al. (1995) Byrd RH, Lu P, Nocedal J, Zhu C. 1995. A limited memory algorithm for bound constrained optimization. SIAM Journal on Scientific Computing 16 (5): 1190–1208.
- 7Carrassi et al. (2018) Carrassi A, Bocquet M, Bertino L, Evensen G. 2018. Data assimilation in the geosciences: An overview of methods, issues, and perspectives. Wiley Interdisciplinary Reviews: Climate Change 9 (5).
- 8Chao and Chang (1992) Chao WC, Chang LP. 1992. Development of a four-dimensional variational analysis system using the adjoint method at GLA. part i: Dynamics. Mon. Weather Rev. 120 : 1661–1673.
