The Generalized Cross Validation Filter
Giulio Bottegal, Gianluigi Pillonetto

TL;DR
This paper introduces a novel GCV filter that extends Kalman filtering to enable efficient, real-time updating of the GCV score, facilitating online parameter estimation in inverse problems and regularization.
Contribution
The paper develops a new GCV filter that reduces update complexity to O(1), allowing for online application of GCV in state estimation and system identification.
Findings
GCV update cost is reduced to O(1) per time step.
The GCV filter extends Kalman filter equations for online use.
Applications demonstrated in state estimation and system identification.
Abstract
Generalized cross validation (GCV) is one of the most important approaches used to estimate parameters in the context of inverse problems and regularization techniques. A notable example is the determination of the smoothness parameter in splines. When the data are generated by a state space model, like in the spline case, efficient algorithms are available to evaluate the GCV score with complexity that scales linearly in the data set size. However, these methods are not amenable to on-line applications since they rely on forward and backward recursions. Hence, if the objective has been evaluated at time and new data arrive at time t, then O(t) operations are needed to update the GCV score. In this paper we instead show that the update cost is , thus paving the way to the on-line use of GCV. This result is obtained by deriving the novel GCV filter which extends the classical…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
††thanks: This research has been partially supported by the MIUR FIRB project RBFR12M3AC-Learning meets time: a new computational approach to learning in dynamic systems and by the Progetto di Ateneo CPDA147754/14-New statistical learning approach for multi-agents adaptive estimation and coverage control. This paper was not presented at any IFAC meeting. Corresponding author Gianluigi Pillonetto Ph. +390498277607.
and
The Generalized Cross Validation Filter
Giulio Bottegal
Gianluigi Pillonetto
Department of Electrical Engineering, TU Eindhoven, Eindhoven, The Netherlands (e-mail: [email protected])
Department of Information Engineering, University of Padova, Padova, Italy (e-mail: [email protected])
Abstract
Generalized cross validation (GCV) is one of the most important approaches used to estimate parameters in the context of inverse problems and regularization techniques. A notable example is the determination of the smoothness parameter in splines. When the data are generated by a state space model, like in the spline case, efficient algorithms are available to evaluate the GCV score with complexity that scales linearly in the data set size. However, these methods are not amenable to on-line applications since they rely on forward and backward recursions. Hence, if the objective has been evaluated at time and new data arrive at time , then operations are needed to update the GCV score. In this paper we instead show that the update cost is , thus paving the way to the on-line use of GCV. This result is obtained by deriving the novel GCV filter which extends the classical Kalman filter equations to efficiently propagate the GCV score over time. We also illustrate applications of the new filter in the context of state estimation and on-line regularized linear system identification.
keywords:
Kalman filtering; generalized cross-validation; on-line system identification; inverse problems; regularization; smoothness parameter; splines
1 Introduction
Linear state space models assume the form
[TABLE]
where is the state at instant , is the output, while and are random noises. The matrices and regulate the state transition and the observation model at instant . This kind of models plays a central role in the analysis and design of discrete-time systems [17]. Applications abound and include tracking, navigation and biomedicine.
In on-line state estimation, the problem is the reconstruction of the values of from measurements of collected over time. When the matrices and and the noises covariances are known, the optimal linear estimates are efficiently returned by the classical Kalman filter [1]. However, in many circumstances there can be unknown model parameters that also need to be inferred from data in an on-line manner, e.g. variance components or entries of the transition/observation matrices. One can interpret such parameters as additional states. Then, the extended Kalman filter [16] or more sophysticated stochastic techniques, such as particle filters and Markov chain Monte Carlo [10, 22, 3, 9], can be used to track the filtered posterior. Another technique consists of propagating the marginal likelihood of the unknown parameters via a bank of filters [1, Ch. 10]. In this paper, we will show that another viable alternative is the use of an approach known in the literature as generalized cross validation (GCV) [12].
In the literature of statistics and inverse problems, GCV is widely used in off-line contexts to estimate unknown parameters entering regularized estimators [5, 37, 40]. This approach was initially used to tune the smoothness parameter in ridge regression and smoothing splines [14, 12, 33]. GCV is now also popular in machine learning, used to improve the generalization capability of regularized kernel-based approaches [34, 8], such as regularization networks, which contain spline regression as special case [31, 11].
To introduce GCV in our state space context, we first recall that smoothing splines are closely linked to state space models of -fold integrated Wiener processes [25]; then it appears natural to extend GCV to general state space models. To this end, assume that measurements have been collected up to instant and stacked in the vector . Denote with the vector containing the optimal linear output estimate111The components of are thus given by , where the smoothed state can be obtained for any with operations by a fixed-interval Kalman smoothing filter [32, 19]. and use to denote the influence matrix satisfying
[TABLE]
Then, the parameter estimates achieved by GCV minimize
[TABLE]
where is the sum of squared residuals, i.e.
[TABLE]
and are the degrees of freedom given by the trace of , i.e.
[TABLE]
In the objective (2), the term accounts for the goodness of fit while assumes values on and measures model complexity. In fact, in nonparametric regularized estimation, the degrees of freedom can be seen as the counterpart of the number of parameters entering a parametric model [20, 13, 26].
GCV is supported by important asymptotic results. Also, for finite data set size it turns often out a good approximation of the output mean squared error [7]. It is worth stressing that such properties have been derived without postulating the correctness of the prior models describing the output data [38, 39]. In control, this means that GCV can compensate for possible modeling mismatch affecting the state space description.
Despite these nice features, the use of GCV within the control community appears limited, in particular in on-line contexts. One important reason is the following one. For state space models, there exist efficient algorithms which, for a given parameter vector, return its GCV score with operations [18, 4], see also [15, 35, 21] for procedures dedicated to smoothing splines. But all of these techniques are not suited to on-line computations since they involve forward and backward recursions. Hence, if is available and new data arrive at time , other operations are needed to achieve . In this paper, we will instead show that the update cost is , thus paving the way to a more pervasive on-line use of GCV. This result is obtained by deriving the novel GCV filter which consists of an extension of the classical Kalman equations. Thanks to it, one can run a bank of filters (possibly in parallel) to efficiently propagate GCV over a grid of parameter values. This makes the proposed GCV filter particularly suitable for applications where a measurement model admits a state space description with dynamics depending on few parameters, see e.g. the next section for an application in numerical differentiation. In this framework, an implementation of the GCV filter via a bank of parallel filters turns out computationally attractive.
The paper is organized as follows. In Section 2, first some additional notation is introduced. Then, the GCV filter is presented. Its asymptotic properties are then discussed in Section 3. In Section 4 we illustrate some applications, including also smoothing splines and on-line regularized linear system identification with the stable spline kernel used as stochastic model for the impulse response [28, 29]. Conclusions end the paper while the correctness of the GCV filter is shown in Appendix.
2 The GCV filter
2.1 State space model
First, we provide full details about our measurements model. We use to denote a random vector with mean and covariance matrix . Then, our state space model is defined by
[TABLE]
where the initial condition and all the nosies are mutually uncorrelated. We do not specify any particular distribution for these variables, since the GCV score does not depend on the particular noise distribution222Of course, GCV may result not effective if the noises are highly non-Gaussian. Different approaches, like particle filters, should instead be used if linear estimators perform poorly due e.g. to multimodal noise distributions.. If are Gaussian, then the Kalman filter provides the optimal state estimate in the mean-square sense. In the other cases, the Kalman filter corresponds to the best linear state estimator [1]. In addition, just to simplify notation the measurements are assumed scalar, so that represents the noise variance.
We assume that some of the parameters in (3) may be unknown, or could enter and ; however, we do not stress this possible dependence to make the formulas more readable. The matrix is assumed to be independent of . Such parameter is typically unknown, being connected to the ratio between the measurement noise variance and the variance of the driving noise. It corresponds to the regularization parameter in the smoothing-splines context described in the example below.
Example 1** (Smoothing splines [30]).**
Function estimation and numerical differentiation are often required in various applications. These include also input reconstruction in nonlinear dynamic systems as described e.g. in [30]. Assume that one is interested in determining the first derivatives of a continuous-time signal measured with non-uniform sampling periods . Modeling the signal as an -th fold integrated Wiener process one obtains the stochastic interpretation of the -th order smoothing splines [40]. In particular, one can use (3) to represent the signal dynamics as follows
[TABLE]
Such model depends on the measurement noise variance , making this application particularly suited for the GCV filter.
2.2 The GCV filter
The GCV filter equations are now reported. Below, denotes the optimal linear one-step ahead state prediction having covariance . Its dynamics are regulated by the classical Kalman filter via (5a), (5c) and the Riccati equation (5e).
GCV filter
Initialization
[TABLE]
Update
[TABLE]
It is apparent that the difference w.r.t the classical Kalman filter is the presence of the additional state of the same dimension of . Comparing (5c) and (5d), one can see that is replaced by . In addition, the dynamics of are still driven by the innovation , but the Kalman gain given by (5a) is substituted by the defined by (5b). In turn, such gain depends on which is propagated over time through a modified version of the Riccati equation given by (5f). The GCV filter is graphically depicted in Fig. 1.
3 Asymptotic behavior and the smoothing ratio
3.1 Asymptotic behavior of the GCV filter
We first consider the case where the state-space model (3) is time-invariant, i.e. the matrices , , and are constant in . The structure of the equations governing the GCV filter permits to easily understand its asymptotic behaviour. In particular, exploiting well known properties of the Kalman filter [1], the following result is obtained (see the Appendix for a proof).
Proposition 1**.**
Assume that the system (3) is time-invariant, stabilizable and detectable. Then, for any we have
[TABLE]
where and are the unique symmetric and semidefinite positive matrices solving, respectively, the algebraic Riccati equation
[TABLE]
and the Lyapunov equation
[TABLE]
where . In addition, all the roots of the matrix are inside the unit circle so that the (asymptotic) GCV filter is asymptotically stable.
Properties of the GCV filter can be also characterized in the time-varying case. In particular, following Section 2 of [2], one can first replace stabilizability and detectability with the assumptions of uniform stabilizability and detectability. Then, following the same reasonings contained in the proof of Proposition 1, Theorem 5.3 in [2] ensures the uniform exponential stability of the GCV filter.
3.2 Fast regularization parameter tuning and the smoothing ratio
Proposition 1 leads also to a new computationally appealing approach to tune the regularization parameter e.g. in smoothing splines. In particular, consider the scenario described in [21] where an unknown function has to be reconstructed by spline regression from equally spaced noisy samples. When assumptions in Proposition 1 hold true, it is possible to compute off-line the gains
[TABLE]
Then, one can exploit the asymptotic (suboptimal) GCV filter, with the guarantee that the objective values will converge to the exact GCV scores as increases. Moreover, in off-line contexts this approach appears computationally appealing even when compared to the many GCV-based spline algorithms developed in the last decades [41, 40, 18, 15, 35].
Furthermore, [21] defined the asymptotic smoothing ratio as
[TABLE]
also providing an interesting closed-form expression for the cubic splines case useful to further speed up the tuning of . For the general case, we notice that Proposition 1 gives also a numerical procedure to compute the asymptotic smoothing ratio (for different values of ). In fact, if (3) is stabilizable and detectable, combining (5g) and Proposition 1 we obtain
[TABLE]
with and defined, respectively, in (6) and (7).
4 Numerical Examples
4.1 Spline example
We consider the reconstruction of the function taken from [29] from samples collected at 400 instants randomly generated from a uniform distribution on . The measurement noise is Gaussian with standard deviation equal to 0.3. We model as the two-fold integral of white noise setting in the time-varying state space model reported in Example 1. This corresponds to reconstructing using cubic smoothing splines [40].
We use to denote the vector containing the noiseless outputs (corresponding to the second entries of ). We denote the average of by the scalar quantity . Then, the performance measure is the percentage fit
[TABLE]
where is the estimate of obtained through the Kalman smoother [1], and a column vector with all entries equal to 1. The following two different estimators are tested:
- •
GCV: this approach estimates exploiting the GCV filter. More specifically, the GCV score is propagated over a grid containing 100 values of logarithmically spaced on the interval . Then, at any the estimate is computed by a Kalman smoothing filter which exploits the that minimizes .
- •
Oracle: the same as GCV except that maximizes the fit . Note that this approach is not implementable in practice since it uses an oracle that knows the noiseless (unavailable) output .
The left panel of Fig. 2 displays the noiseless output (solid line), the measurements () and the function estimate returned by GCV (dashed line) which turns out close to . The right panel also shows that the GCV filter is able to track well and in an on-line manner the time-course of returned by Oracle.
4.2 GCV capability to compensate for model mismatch
We consider the following discrete-time model (see also [23, Section 6]):
[TABLE]
with zero-mean Gaussian noises of covariances
[TABLE]
We will use data generated by this model to test the capability of the GCV filter to compensate for mismatches between the true system and the model used to track the data by tuning in an on-line manner. As in the previous example is the vector containing the first noiseless outputs (which are the second entries of ) and the performance measure is (8). The following three different estimators are tested:
- •
GCV: this approach uses a wrong transition covariance given by
[TABLE]
and then estimates exploiting the GCV filter over a grid with 100 values logarithmically spaced on . Then, at any the estimate is computed by a Kalman smoothing filter which exploits the minimizing .
- •
Oracle: the same as GCV except that maximizes the fit in (8).
- •
Nominal: the estimate is returned by a Kalman smoothing filter defined by the nominal wrong covariance and . Thus, this approach does not try to compensate for model mismatch since it does not tune from data.
The left panel of Fig. 3 displays the noiseless output (solid line), the measurements () and the smoothed output obtained by GCV (dashed line) which appears close to truth. In the right panel, one can also see the trajectory in time of the returned by Oracle and by GCV. One can appreciate the capability of the GCV filter to compensate the modelling mismatch by tracking a regularization parameter leading to a high fit .
To further support these findings, we have also performed a Monte Carlo study of 100 runs. During each run, 200 output measurements are generated using (9) and is reconstructed by GCV, Oracle and Nominal. From the MATLAB boxplots of the 100 fits (8) reported in Fig. 4, the robustness of GCV emerges clearly. Its performance is in fact very close to that of the oracle-based procedure.
4.3 On-line regularized linear system identification
Now, we consider a linear system identification problem where the aim is to estimate an unknown impulse response from input-output measurements. Assuming a high order FIR, the model describing the outputs collected up to instant , and stacked in the (column) vector , is
[TABLE]
where denotes the -dimensional vector whose components are the impulse response coefficients, the regression matrix is defined by the input samples and is the measurement noise vector, which we assume white and Gaussian.
To solve this problem, we use the kernel-based approach originally proposed in [28, 27, 6]. The impulse response estimate is given by
[TABLE]
It makes use of the regularization matrix induced by the so called first-order spline kernel, i.e. its entry is
[TABLE]
where is an hyperparameter which regulates the rate of decay to zero of the components of . We refer the reader also to [29] for further details on advantages of (11) over classical parametric approaches.
In real applications, both and are unknown. Since we consider a situation where has to be estimated on-line, we will estimate these two hyperparameters by the GCV filter. To do that, we first notice that (11) corresponds to the maximum a posteriori (MAP) estimator of and, under the stated Gaussian assumptions, also to its minimum mean-square estimator (MMSE). The estimate of can then be computed using the Kalman filter. In fact the state space model is
[TABLE]
where the state vector is the stochastic model for (with for any ), and are the -th entries of and , respectively, and is the -th row of .
We define a grid in the plane taking the points such that is in the set , while assumes values in a logarithmically spaced grid of 20 point between and . In this way, the grid consists of 140 points. We run 140 GCV filters in parallel, each corresponding to one of the points; when a new measure (and ) is available, we update the GCV score of each pair , selecting the one giving the minimum score.
We test the obtained GCV filter for on-line regularized system identification on a set of 100 Monte Carlo runs. At any run, a random impulse response of length is generated using the same mechanism described in [24, Section 7.4]. The generated system is fed with a with noise sequence of unit variance. Note that this type of input is persistently exciting and guarantees the observability of the system (4.3) (see e.g. [2]), avoiding the covariance windup phenomenon [36]. The standard deviation of the measurement noise is that of the 200 noiseless outputs divided by 10. We assume the system is at rest (the input is equal to zero) prior to the data collection. The performance of the estimator is evaluated by means of the fit (as a function of time)
[TABLE]
where is the impulse response generated at the -th Monte Carlo run, its mean, and its estimate (the impulse response estimate is function of the time instant ).
An example of one of the Monte Carlo runs is given in Fig. 5, which shows the evolution in time of the impulse response estimate and its fit. It suffices 50 measurements to the GCV filter to achieve an appreciable fit. The overall results of the Monte Carlo experiment are summarized in Fig. 6, which depicts the average fit of the impulse responses as a function of time. It can be seen that, after a short transient phase, the fit increases monotonically and achieves a high average value.
5 Conclusions
The novel filter here presented allows to propagate efficiently the GCV score over time. Hence, unknown parameters entering a state space model can now be estimated in an on-line manner resorting to one of the most important techniques used for parameter estimation. The asymptotic properties of the GCV filter provide also a new very efficient way to estimate the regularization parameter e.g. in smoothing splines. Applications of the new filter have been illustrated using artificial data regarding a function estimation and on-line regularized linear system identification.
A Matlab implementation of the GCV filter is available at the web page http://www.dei.unipd.it/ giapi/.
6 Appendix
6.1 Derivation of the GCV filter
Without loss of generality, we set the initial system condition to zero, i.e. . We also use and to denote the column vectors containing the states, the outputs and the measurements noises up to instant , i.e.
[TABLE]
Then, it holds that
[TABLE]
where is the regression matrix built using the measurement matrices , . We also use and to denote the state and output covariance matrix, i.e.
[TABLE]
where is the identity matrix. Note that, using the above notation, the smoothed estimate of , already encountered in Section 1, is
[TABLE]
so that the degrees of freedom at instant turn out
[TABLE]
The following simple lemma is useful for our future developments.
Lemma 2**.**
One has
[TABLE]
Proof: In view of (15), we start noticing that
[TABLE]
Then, (18) is obtained from the following equalities
[TABLE]
where the last two passages exploit (20) and (17), respectively.
Eq. 19 is instead proved as follows
[TABLE]
where the second and third equality exploit (20) and (16), respectively.
The dynamics of the matrix in the GCV filter are regulated by the discrete-time algebraic Riccati equation (DARE), which can be also rewritten as
[TABLE]
It is now easy to see that the matrix entering the GCV filter is the partial derivative of w.r.t. . In fact, differentiating the DRE, and adopting the notation , one has
[TABLE]
Exploiting the definition of and rearraging the terms, the recursive formula (5f) is obtained.
Now, consider the dynamics of the predicted state
[TABLE]
We now show that is the partial derivative of w.r.t. . In fact, differentating the above equation using the correspondence , one obtains
[TABLE]
This, combined with the definition of , leads to the recursive formula (5d).
Now, exploiting well known properties of the innovations sequence , whose variances are , and recalling that , we have
[TABLE]
Then, the recursive formula (5g) for the degrees of freedom is obtained combining the above equation and (18).
Still using properties of the innovations sequence, and recalling that , one also has
[TABLE]
This equation, in combination with (19), proves the correctness of the update rule (5h) for the sum of squared residuals and completes the derivation.
6.2 Proof of Proposition 1
If the system (3) is stabilizable and detectable, then standard properties of the algebraic Riccati equation (6) ensure that is symmetric and positive semidefinite and that the Kalman filter, corresponding to (5a), (5c) and (5e), is asymptotically stable (see [1, p. 77]). Because the Kalman filter is asymptotically stable, the matrix has all the eigenvalues inside the unit circle, ensuring that (7) admits a unique positive semidefinite solution [1, p. 67]. The filter state transition matrix which regulates the dynamics of and is
[TABLE]
and so it also has all eigenvalues inside the unit circle at least for sufficiently large .
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] B. D. O. Anderson and J. B. Moore. Optimal Filtering . Prentice-Hall, Englewood Cliffs, N.J., USA, 1979.
- 2[2] B.D.O. Anderson and J.B. Moore. Detectability and stabilizability of time-varying discrete-time linear systems. SIAM Journal on Control and Optimization , 19(1):20–32, 1981.
- 3[3] C. Andrieu, A. Doucet, and R. Holenstein. Particle markov chain monte carlo methods. Journal of the Royal Statistical Society: Series B (Statistical Methodology) , 72(3):269–342, 2010.
- 4[4] C.F. Ansley and R. Kohn. Efficient generalized cross-validation for state space models. Biometrika , 74(1):139–148, 1987.
- 5[5] M. Bertero. Linear inverse and ill-posed problems. Advances in Electronics and Electron Physics , 75:1–120, 1989.
- 6[6] T. Chen, H. Ohlsson, and L. Ljung. On the estimation of transfer functions, regularizations and Gaussian processes - revisited. Automatica , 48(8):1525–1535, 2012.
- 7[7] P. Craven and G. Wahba. Smoothing noisy data with spline functions. Numerische Mathematik , 31:377–403, 1979.
- 8[8] T. Evgeniou, M. Pontil, and T. Poggio. Regularization networks and support vector machines. Advances in Computational Mathematics , 13:1–50, 2000.
