Functional Registration and Local Variations: Identifiability, Rank, and Tuning
Anirvan Chakraborty, Victor M. Panaretos

TL;DR
This paper establishes conditions for the identifiability of phase and amplitude variations in functional data, and introduces a parameter-free registration method based on local variation, with theoretical guarantees.
Contribution
It provides sharp nonparametric conditions for identifiability and proposes a novel local variation-based registration method without tuning parameters.
Findings
Identifiability depends on the underlying rank of the data.
The proposed method is free of tuning parameters and avoids over/under-registration.
Asymptotic theory quantifies bias under mild departures from identifiability.
Abstract
We develop theory and methodology for the problem of nonparametric registration of functional data that have been subjected to random deformation (warping) of their time scale. The separation of this phase variation ("horizontal" variation) from the amplitude variation ("vertical" variation) is crucial in order to properly conduct further analyses, which otherwise can be severely distorted. We determine precise nonparametric conditions under which the two forms of variation are identifiable. These show that the identifiability delicately depends on the underlying rank. By means of several counterexamples, we demonstrate that our conditions are sharp if one wishes a genuinely nonparametric setup; and in doing so we caution that popular remedies such as structural assumptions or roughness penalties can easily fail. We then propose a nonparametric registration method based on a "local…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Functional Registration and Local Variations: Identifiability, Rank, and Tuning
Anirvan Chakrabortylabel=e1][email protected] [
Victor M. Panaretos label=e2][email protected] [ Indian Institute of Science Education & Research (IISER) Kolkata, India
Institut de Mathématiques, École Polytechnique Fédérale de Lausanne (EPFL), Switzerland
Abstract
We develop theory and methodology for the problem of nonparametric registration of functional data that have been subjected to random deformation (warping) of their time scale. The separation of this phase variation (“horizontal” variation) from the amplitude variation (“vertical” variation) is crucial in order to properly conduct further analyses, which otherwise can be severely distorted. We determine precise nonparametric conditions under which the two forms of variation are identifiable. These show that the identifiability delicately depends on the underlying rank. By means of several counterexamples, we demonstrate that our conditions are sharp if one wishes a genuinely nonparametric setup; and in doing so we caution that popular remedies such as structural assumptions or roughness penalties can easily fail. We then propose a nonparametric registration method based on a “local variation measure”, the main element in elucidating identifiability. A key advantage of the method is that it is free of any tuning or penalisation parameters regulating the amount of alignment, thus circumventing the problem of over/under-registration often encountered in practice. We provide asymptotic theory for the resulting estimators under the identifiable regime, but also under mild departures from identifiability, quantifying the resulting bias in terms of the amplitude variation’s spectral gap.
62M99,
62G08,
62G20,
Functional Data Analysis,
Phase Variation,
Synchronisation,
Warping,
keywords:
[class=MSC]
keywords:
\startlocaldefs\endlocaldefs
and
Contents
1 Background and Contributions
Background
Functional observations can fluctuate around their mean structure in broadly two ways: (a) amplitude variation, and (b) phase variation. The first type of variation is analysed using functional principal component analysis, which stratifies the variation in amplitude (or variation in the “vertical axis”) across the different eigenfunctions of the covariance operator of the underlying distribution. The second kind of variation, if present, is more subtle and can drastically distort the analysis of a functional dataset. It typically manifests itself in functional data representing physiological processes or physical motion, and consists in deformations of the time scale of the functional data (or variation in the “horizontal axis”), associating to each observation its own unobservable time scale resulting from a transformation of the original time scale by a time warp. Specifically, instead of observing curves , one actually observes warped versions , where the ’s are unobservable (random) homeomorphisms termed warp maps. In the presence of phase variation, the mean of the warped data conditional on the warping, , is a distortion of the true mean by the warp map. Failing to account for the time transformation will yield deformed mean estimates, converging to rather than . More dramatic still will be the effect on the estimation of the covariance of the latent process, inflating its essential rank, and yielding uninterpretable principal components. We refer to Section 2 in Panaretos and Zemel (2016) for a detailed discussion of these effects. Consequently, in the presence of phase variation in the data, the natural first step in the analysis should be to register the data, i.e., to simultaneously transform/synchronise the curves back to the objective time scale.
Owing to the rather complex nature of the registration problem, a variety of different assumptions on the latent process and the warp maps have been considered, and correspondingly a multitude of methods have been investigated: landmark based registration (Kneip and Gasser, 1992); template/target based registration (Ramsay and Li, 1998); registration using dynamic time warping (Wang and Gasser, 1997, 1999); registration based on local regression (Kneip et al., 2000); a “self-modelling” approach by Gervini and Gasser (2004) for warp maps expressible as linear combinations of B-splines; related registration procedures under assumptions on functional forms of the warp maps that result in a finite dimensional family of deformations (Rønn, 2001; Gervini and Gasser, 2005); a functional convex synchronization approach to registration (Liu and Müller, 2004); registration using “moments” of the data curves (James, 2007); registration based on a parsimonious representation of the registered observations by the principal components (Kneip and Ramsay, 2008); pairwise registration of the warped functional data under monotone piecewise-linear warp maps (Tang and Müller, 2008); a joint amplitude-phase analysis with this pairwise registration procedure but considering step-function (thus finite dimensional) approximations of the warp maps using finite difference of their log-derivatives (centered log-ratio transform) (Hadjipantelis et al., 2015); registration when the warp maps are generated as compositions of elementary “warplets” (Claeskens, Silverman and Slaets, 2010); and registration using a warp-invariant metric between curves when the warp functions are diffeomorphisms on an interval (Srivastava et al., 2011). The above list is not exhaustive and we refer to Marron et al. (2015) for an oveview and comparison of some of the registration procedures mentioned above. More recently, Pigoli et al. (2017) applied the pairwise registration procedure of Tang and Müller (2008) for two-dimensional curves, where the warping is in only one of the dimensions, while Lila and Aston (2017) generalized the pairwise registration method for manifold valued data.
Several of the above contributions consider the case when the warp maps are themselves random, and in such cases, a canonical set of assumptions is usually required:
- (a)
is a strictly increasing homeomorphism with probability one, and
- (b)
, where is the identity map, .
The first assumption rules out “time-reversal” or “time-jumps”, while the second disallows an overall speed-up or slow-down of time. Further to these natural assumptions, most of the above cited papers impose additional smoothness and structural assumptions on the warp maps, which require tuning parameters to be selected. However, it is unclear whether these additional assumptions are either necessary or indeed sufficient for identifiability to hold. It is an open problem to determine what assumptions must one minimally impose on the latent functional data generating process so that the registration problem be identifiable under conditions (a) and (b) on the warp maps. This is of importance to understand since, in practice, one rarely has more detailed insights regarding the underlying warping phenomenon.
Consider the model
[TABLE]
for the latent process, with a unit norm deterministic function, random scalars, and zero-mean random functions of unit variance (i.e. ). When is unrestricted, the model (1) spans any possible functional datum. The value of then regulates the balance between an (effectively) low rank model () or a higher rank model (larger ). When one has exactly one has a rank 1 model. Several well-known approaches for registration available in the literature (see, e.g., Rønn (2001), Gervini and Gasser (2004, 2005), Tang and Müller (2008), Srivastava et al. (2011)) have considered variants of model (1), with the assumption that is small relative to (for this reason, and for ease of reference, we thus henceforth refer to Model (1) as the “standard model”). In other words, it is postulated that if it were not for phase variation, important landmark features such as peaks and valleys of the latent process would not drastically change from realisation to realisation. In effect, there seems to be a certain concordance that identifiability (and hence consistency in the usual sense) rests crucially on an implicit assumption that the amplitude variation of the synchronised functions is of low rank, and phase variation is dominant over amplitude variation.
Observe that the dominating component in the warped process obtained by warping model (1) forms a sub-class of the so-called general non-linear shift models (NLSM). These models find extensive use in comparison of semi-parametric regression models (see, e.g., Härdle and Marron (1990)), and have been studied in the context of landmark and dynamic registration techniques by Kneip and Gasser (1992) and Wang and Gasser (1997, 1999). Also note that the landmark principle of registration essentially stipulates that the true curves have similar shape (thus having the same landmarks) but possibly differ in their amplitude component. Although some of the earlier papers, e.g., Ramsay and Li (1998), Kneip et al. (2000), Kneip and Ramsay (2008), Claeskens, Silverman and Slaets (2010) consider higher rank models for the latent process corresponding to nontrivial (with additional structural assumptions on warp maps), it is not known whether these procedures are truly identifiable/consistent. Indeed, Kneip and Ramsay (2008) (see p. 1160) acknowledged the fact that for such higher rank models, one can have different valid registrations based on the degree of complexity of the warp maps that one allows (cf. Counterexample 5). Further, as hinted in Tang and Müller (2008), who consider model (1), identifiable (consistent) registration appears not to be guaranteed unless one lets as . Recently, Wagner and Kneip (2019) studied a nonparametric registration procedure that registers the warped curves to a low dimensional subspace provided that the number of “feature points” are bounded almost surely.
Our Contributions
We contribute to the nonparametric synchronisation problem with theory, methodology, and asymptotics, and corroborate our findings with simulations and a data analysis:
Firstly, we provide a rigorous study of the issue of identifiability, providing conditions for the standard model 1 to be identifiable. Conversely, we show by means of several counterexamples that our conditions are relatively sharp. 2. 2.
Secondly, we develop methodology to address the problem of nonparametric and consistent recovery of the warp maps from discretely warped curves without structural assumptions on the warp maps further to (a) and (b). A salient feature of our methods is that they do not require penalisation or tuning related to the warp maps. Our methodology is adapted to cover all three standard observation settings: complete observation, discrete observation, and discrete observation with measurement error. 3. 3.
We carry out an asymptotic analysis in all three observation settings. We also investigate the setting when the model deviates from identifiability and derive results quantifying the amount of asymptotic bias incurred in terms of the spectral gap of the amplitude variation (Theorem 6). 4. 4.
We probe the finite sample performance of our methodology (Section 5), for all possible observation regimes, and compare to other popular registration techniques, including in settings departing from the identifiable regime, observing a noteworthy stability of our method to mild such departures. 5. 5.
The method is further illustrated by analysis of a functional dataset of Triboleum beetle larvae growth curves (Section 6), yielding biologically interpretable results.
The key to our results is the novel use of a criterion that measures the local amount of deformation of the time scale (Section 3), the local variation measure of , with associated cumulative distribution . Here, and throughout the rest of the paper, for any function (random or deterministic), its first order derivative will be denoted by , i.e., . Further, its second, third and fourth order derivatives will be denoted by , and , respectively. The local variation measure reflects how the total amount of variation of the curve is distributed on the real axis. The simple but consequential insight is that by a change-of-variable argument, the total variation measure remains invariant under any strictly increasing deformation of the time scale of , namely, where . However, it is the local amount of deformation that provides the information about the warping mechanism. This insight is what circumvents the need to introduce registration tuning parameters – even when the curves are observed over a discrete grid111Of course, once the warp maps are estimated, one would have to smooth the warped discrete data in order to register them, since the warped data are not observed at all points of their domain. And, if there is measurement error in the observations, then some pre-smoothing will be needed. But in either case, this smoothing will be on the data itself (either as a pre-processing or post-processing step), and no smoothing penalties or structural assumptions will be required on the registration maps themselves.. This connection also guides us in the construction of counterexamples, illustrating where caution should be taken.
2 Identifiability and Counterexamples
Recall that the standard model for the latent/synchronised process prior to warping (Equation 1) takes the general form
[TABLE]
This, depending on the constraints imposed on the random variable and the scalar , can be of arbitrarily large rank, and indeed can span any functional datum. Usually is expected to be the dominant effect relative to (i.e. ), corresponding to an effectively low rank model. We now give sufficient conditions on the standard model for that identifiability will hold in a genuine nonparametric sense. In simple terms, these require the process must be exactly of rank 1 (i.e. or .
Theorem 1** (Identifiability).**
Let be a random elements in of rank one, i.e., for deterministic functions with , and with vanishing on at most a countable set. Assume that are strictly increasing homeomorphisms in , and such that . Write . Then,
[TABLE]
The proof is given in Section A.2 of the Appendix (Supplementary Material). The assumption that does not vanish except perhaps on a countable set excludes the possibility of constant functions, in which case the problem is vacuous and identifiability trivially fails. Note that the identifiability result in Theorem 1 does not require that and be independent.
One might understandably argue that the rank 1 assumption in the previous theorem is restrictive. Perhaps surprisingly, though, the condition can be seen to be rather sharp. We construct a series of counterexamples, demonstrating how unstable identifiability is to higher ranks (even rank 2). These illustrate that the situation cannot be rectified at a genuinely nonparametric level, e.g. by assuming specific classes of models on the synchronised processes (such as splines or trigonometric functions); or by imposing qualitative non-parametric constraints, such as roughness penalties, Sobolev norm bounds or rank restrictions on the warp maps (or combinations of these).
Counterexample 1**.**
Our first counterexample shows that the same rank 2 process can arise either as warped rank 1 process, or as a syncrhonised rank 2 process. Both the process itself and the warp maps can be taken to be of rank at most 2 (notice that a rank 1 warp map would need to be the identity almost surely). Define and Take to be a standard Gaussian random variable and for . Now define a random warp map such that . Then satisfies (a) and (b). Now define , where is a Bernoulli random variable with success probability and . Let and so that where and , . Since and are , and and are bounded away from zero on , so are their inverses. Also, the inverses are as well. It is easy to check that . Further, it is easy to show that and are linearly independent. Consequently, we may define a new process , which is a rank two process. Define . Then, (in fact ) but they have been generated using two different latent processes, namely and , and warp maps, namely and , which of course do not have the same distribution.
Counterexample 2** (Rank 2 warped to rank 1 – smooth latent process and warp maps).**
We will give two constructions demonstrating that the same rank one process can arise in one of infinitely many ways: (i) as a rank one analytic process with no warping, and (ii) one of an infinite collection of rank two analytic processes subjected to warping by one of an infinite collection of non-trivial analytic warp maps satisfying (a) and (b).
(A) First take the latent model class to consist of linear combinations of trigonometric functions and polynomials. Define and for some . Let , where . Here and with satisfying . It can be checked that satisfies (a) and (b) for all . Let be a random variable independent of . Define and . It can be checked that for all . Since an are independent, it follows that . Also, since (by direct calculation), the form of given above is in fact its Karhunen-Loéve (KL) expansion, which is of rank 2, and this holds for all . The plots of sample paths of and along with the warp maps and are shown in Figure 1.
(B) For the second construction, we take the latent model class to consist of linear combinations of polynomials only. Define . Fix and any finite subset of . Also, fix reals satisfying . Consider the Legendre polynomials on . Since these satisfy for , it follows that . Define and , where , where . Here and elsewhere in the paper, will denote the usual sup-norm on . The above construction ensures that , , and satisfies (a). It is clear that satisfies (b). Let and , where is as in the first construction. Then, it can be shown that . Also, is rank 2, and the above form is in fact its KL expansion because and , which follows as earlier.
By taking to be a constant random variable, this counterexample also shows that one cannot extend the identifiable regime from to , where .
Counterexample 3** (Penalising the Warp Maps).**
We will show that even if one penalises the warp maps, e.g., by one or both of and , still one can get infinitely many possible solutions for the registration problem. Under the setup of (A) in Counterexample 2, and . For (B) in the previous counterexample, it can be shown using the orthogonality of the Legendre polynomials that and , where denotes the usual norm on . Thus, in both cases, for any , the sum of the two penalty terms can be made arbitrarily small by choosing large enough (depending on the choices of the other parameters – , , ’s and ’s).
The above facts imply that if one wants to carry out the registration using the penalization procedure , where is a class of warp maps, and takes values in an appropriate synchronized space of linear combinations of functions, then we have infinitely many registrations valid registrations as follows:
(i) under setup (A) – if we allow to include monotone homoemorphisms on whose deviation from the identity is a trigonometric function, and even if is restricted to linear combinations of linear and trigonometric functions (both and belong to this class).
(ii) under setup (B) – even if we allow and to only include polynomials.
Note that for both (i) and (ii), the “fit” term becomes zero.
Counterexample 4** (Spline models for latent process and warp map).**
Our next counterexample shows that structural restrictions on the latent synchronised process, such as spline models, will also fail if the rank is higher than 1. We will consider cubic splines but one can similarly construct more elaborate counterexamples involving higher order splines and more knots. Let be a cubic spline with a single knot at , i.e., , and define , where and are fixed. Let and with and as before, and choose . This ensures that satisfies (a) and (b). Define
[TABLE]
where and . Note that is a linear spline with knots at and . Also, and are splines (quadratic and linear, respectively) with knots at . Hence, these can be considered as elements of the cubic spline space with knots at and . So, by repeated application of Theorem 3.1 in Mørken (1991), the functions , , and are elements of the space of cubic splines with a finite set of knots (including and ). So, both and lie in . If we assume that , then it follows that is linearly independent of , and (since these three functions equal zero at ). Thus, is of rank at least two. Now, it can be checked that . Thus, two distinct processes and can be warped (by the maps and , respectively) to produce the same process.
If we choose , i.e., take to be a cubic polynomial (which also lies in trivially), then we can choose to be a spline on of degree with a fixed set of knots. So, in this case, we can have differentiable (instead of a.e. differentiable) warp maps. In this case, we choose . Then, for the same , the conclusion of the above counterexample holds.
Counterexample 5** (Latent process with geometric properties).**
Our last counterexample illustrates that even a priori knowledge of landmarks does not help rectify identifiability if the rank 1 condition is violated. Let so that the latent process has a unique maximum at . A priori knowledge of existence of a unique maximum in synchronized space can be utilized to carry out a landmark/peak alignment of the warped curves. Let us denote the vector space of functions with unique maximum at by , and the vector space of functions proportional to the bell-shaped curve by . Obviously, . Let be any warp map independent of and satisfying (a) and (b). Define a new warp map as follows: . Note that satisfies (a) and (b). Define . It can be checked that the process has a unique maximum at , where satisfies , equivalently, . However, from the construction of , it is easy to check that . So, . Defining and , it follows that although and are different processes. Further, although , it holds that provided , and has rank at least two. This counterexample (without explicit constructions of the latent processes or of the warp maps) is mentioned in Kneip and Ramsay (2008). Note that in both cases, one achieves “consistent registration” in the sense of that paper.
What we learn from these counterexamples is that identifiability crucially rests upon constructing a synchronised space of processes (contained within continuous processes on ) and a warp map space of processes (contained within strictly monotone homeomorphisms onto with identity expectation) such that:
- (I)
Warping causes the latent process to exit the synchronised space, i.e. but .
- (II)
There exists a unique process such that for some random .
Theorem 1 informs us that such a construction is possible by taking to essentially be rank 1 non-constant processes, and otherwise not restricting except for a assumption. The counterexamples demonstrate that allowing higher ranks can have severe effect on identifiability, even if is modeled more concretely, or indeed if is restricted to be smoother. In light of this, we will introduce the terminology of “identifiable regime” to mean the pair implied by the context of Theorem 1:
Definition 1** (Identifiable Regime).**
We define the identifiable regime to involve latent synchronised processes , warp maps , and warped processes , where:
- (I1)
The synchronised process space is , for a real-valued random variable of finite variance and is a deterministic function of unit -norm, whose derivative vanishes at most on a countable subset of .
- (I2)
The warp map space is and is a strictly increasing homeomorphism.
Remark 1** (Unidentifiable vs Identifiable Regimes).**
Deviations from the identifiable regime will be generally termed as a “potentially unidentifiable regime”. We say “potentially” because the conditions (I1) and (I2), though sharp in a fully nonparametric setting, are not necessary. One can presumably produce identifiable models escaping the framework in Definition 1 by introducing explicit parametric assumptions (and/or perhaps considering weaker forms of identifiability). For example one could modify Definition 1 into a rank-two synchronised model under the parametric assumption that . This model is an arguably trivial extension of the rank one model with an inclusion of an additive random vertical shift, which is obviously invariant to phase variation. If and , this model can be shown to be identifiable in the sense of Theorem 1. We refer to the Supplementary Material for a formal statement (Theorem 7) and further details. Note that any method that can estimate the warp maps in the rank 1 model, can also accommodate the presence of a vertical shift as above (see Remark 4), which is why we call the inclusion of a vertical shift a trivial modification.
With identifiability clarified, we now turn to nonparametric methods of estimation. Our goal will be to construct methods that perform provably well in the identifiable regime, remain stable under small departures (e.g. effectively rank 1 rather than precisely rank 1 models), and do not rely on tuning (which adds a layer of arbitrariness and in any case was seen to be unavailing). For these, we will require the notion of local variation measure, introduced in the next section.
In closing this section we remark that Wagner and Kneip (2019) also independently arrive at the conclusion that, under the standard assumptions on the warp maps, identifiability cannot be guaranteed beyond rank one models (though our results are antecedent, see Chakraborty and Panaretos (2017)).
3 Tuning-Free Methodology
Recall that the total variation of a continuous function measures the total distance sweeped by the ordinate of its graph, as the abscissa moves from [math] to . By distorting functions “in the -domain” through an increasing homeomorphism, phase variation will not affect the total amount of variation accrued over the interval . However, it will redistribute this total variation over the subintervals of . This redistribution can be measured by focussing on local variation:
Definition 2** (Local Variation Distribution).**
Given any real function , we define
[TABLE]
where is a partition of and is the collection of all finite partitions of . Noting that is the total variation of , define the local variation distribution as F_{h}(t)={J_{h}(t)}\big{/}{J_{h}(1)}.
Remark 2**.**
Recall that when , it holds that . The general definition comes handy under discrete observation, this one under continuous observation.
We now show that, in the identifiable regime, warping affects the local variation of the underlying process in a rather predictable manner – one that can be used to motivate estimators. We will write and for simplicity.
Lemma 1** (Local Variations and Warp Maps).**
When fall under the Identifiable Regime (1), and are strictly monotone almost surely, , and
Remark 3**.**
In the language of transportation of measure, Lemma 1 says that the warp map pushes forward the original local variation distribution to the warped local variation distribution, in fact optimally so in terms of quadratic transportation cost; and that the synchronised local variation measure is the Fréchet mean of the (random) warped local variation measure in the 2-Wasserstein distance.
Now suppose we have an i.i.d. sample of randomly warped functional data that we wish to register, i.e. we wish to construct nonparametric estimators of the and the on the basis of . If we expect the data to (at least approximately) conform to the identifiable regime (1), we can rely on Lemma (1) as inspiration for tuning-free methodology. We would like to emphasize that this methodology will be applicable whatever the “true model”, of course, but the point is for it to be accurate under the identifiable regime, and stable when mildly departing from identifiability. We construct such methodology under all three different observation regimes on : complete observations (Section 3.1), discrete noiseless observations (Section 3.2), and discrete observations with measurement error (Section 3.3). We then study the performance under identifiability/unidentifiability theoretically in Section 4 and numerically in Section 5, where we indeed observe a certain stability to mild departures from identifiability.
3.1 Fully Observed Functions
Assuming the functions are fully observed, we may proceed as follows:
- Step 1:
Set
[TABLE]
noting that the are immediately available by complete observation of the .
Note that under the identifiable regime (1), estimates .
- Step 2:
Estimate the warp map by , and the registration map by .
- Step 3:
Register the observed warped functional data, by means of .
If we suspect to be in the identifiable regime (1), we may also want to estimate the pairs . In this case, the obvious additional steps will be:
- Step 4:
Compute the empirical covariance operator, say, of the registered data and estimate by the leading eigenfunction of (as a convention, assume that this estimator is aligned with the true , i.e., ).
- Step 5:
Estimate by .
In Step 4 above as well as in all other instances in the paper, denotes the usual inner product on .
3.2 Discretely Observed Functions
In the discretely observed setting, the ’s are not fully observed. Instead, we observe point evaluations
[TABLE]
Here, is a grid over , assumed asymptotically homogeneous in that as . The latent discrete process is denoted by .
Our strategy will be to mimic Steps 1–5 from the fully observed setup. Since the ’s are no longer fully observed, though, in order to have versions of the and , we will draw inspiration from the general definition of the local variation distribution (Equation 2 in Definition 2). First, define
[TABLE]
for and each , where is the set of all ’s satisfying . Note that because we only observe each curve over the grid , we have replaced the supremum over all grids in Equation 2 of Definition 2 by just this one (the finest grid we get to observe). Clearly, has jump discontinuities at the grid points ’s, is càdlàg, and satisfies for all and for all .
For the (discretely) observable warped process, we define
[TABLE]
The ’s also have jump discontinuities at the grid points, and are càdlàg.
Under the identifiable regime, in particular, we would have for all , where
[TABLE]
Its jumps are at most of size . Moreover, in the identifiable regime,
[TABLE]
where for each and are unobserved random variables. The maximum jump size of is .
With the general definitions of and in place, we can now adapt Steps 1–5 to the discrete case. In what follows, the generalized inverse of a function is denoted by , i.e., . The first two steps will remain invariant, except for the fact that they will now employ the discrete local variation measures. This means that we will not require any tuning parameters or smoothness assumptions to estimate the warp and registration maps. The registration itself (the last three steps) will require some smoothing, of course, if it is to make sense:
- Step :
Set and .
Note that under the identifiable regime (1), mimics .
- Step :
Predict the random warp map by and the registration map by .
- Step :
Since the ’s are observed discretely, we do not have information about their values between grid points. Thus, we first smooth each of the using the Nadaraya-Watson kernel regression estimator for an appropriately chosen kernel and bandwidth , denoting resulting smoothed functions by ,
[TABLE]
Define
[TABLE]
to be the registered functional observations and write for their mean.
As in the fully observed situation, if we suspect to be in the identifiable regime (1), we estimate the pairs as follows:
- Step :
Compute the empirical covariance operator of the registered curves , and use its leading eigenfunction as the estimator of (again, assume the convention that the sign is correctly identified, i.e., ).
- Step :
Finally, estimate by for each .
We should point out here that our method is also straightforwardly applicable in the situation where the grid over which the ’s are observed, say, , differs with . The reason for this compatibility is the fact that our approach considers only one curve at a time. We formulate it in the notationally simpler case of a common grid, in order to alleviate the notation in the statement of our asymptotic results in Section 4. Additional remarks on the practical implementation are included in the Supplementary material (Section A.1).
3.3 Discrete Observation With Measurement Error
It can often happen that the discretely observed functional data be additionally contaminated by measurement error. In this case, one has to suitably adapt the registration procedure. In the presence of measurement error, we observe , where was defined in Section 3.2, and with the being a collection of i.i.d. error variables with zero mean and variance , independent of the processes and warp maps.
We will modify the registration procedure as follows. First, construct a non-parametric function estimator of , which is the derivative of the warped process , using the observation for each , and call this estimator . Define analogues of the ’s as
[TABLE]
Note that unlike the discrete observation case described in the previous section, we now have fully functional versions of for each , which allows us to mimic the algorithm in the fully observed scenario in Section 3.1.
- Step :
Set .
Under the identifiable regime (1), in particular, we have estimates .
- Step :
Predict the warp map by , and the registration map by .
- Step :
Construct non-parametric function estimators of the ’s using the ’s, and call them ’s. Define to be the registered functional observations.
If we suspect to be in the identifiable regime (1), we estimate the pairs as follows:
- Step :
Write for the mean of the registered observations and let denote their empirical covariance operator. Take its leading eigenfunction, denoted by , as the estimator of (assuming the same sign convention as earlier).
- Step :
Finally, estimate by for each .
There are two smoothing steps involved in the above algorithm. Given the large literature on non-parametric smoothing techniques, one can choose any smoother. However, the asymptotic results will depend on the efficiency of the chosen smoothing techniques. From now on in this paper, we will use a local quadratic regression approach with kernel and bandwidth for finding . We will then use a local linear estimator with kernel and bandwidth for estimating . These choices are motivated by the advantages of local polynomial estimators in dealing with boundary effects (see, e.g., Fan and Gijbels (1996) and Wand and Jones (1995) for further details on various smoothing techniques). More details on the choices of smoothing parameters are given in Remark 4 after Theorem 5.
Remark 4** (Presence of an additional vertical shift, as in Remark 1).**
In Remark 1, we discussed a trivial rank two identifiable model given by , which is simply the inclusion of a random vertical shift in the rank 1 model. The methods presented still apply in this case. For simplicity, consider fully observed warped data , , from the model with vertical shifts. Since the registration procedure based on local variation measure depends on the process only through its derivative, the estimators of obtained from will be exactly the same as those obtained from considered earlier. The vertical shifts can then be easily estimated by . Indeed, will converge to zero almost surely as (as a corollary to Theorem 2 in Section 4).
4 Asymptotic Theory
We next study the asymptotic properties of the estimators obtained above. We develop separate results for each of the three observation regimes considered (full observation, discrete observation, discrete observation with measurement errors). In what follows, the space is equipped with the norm , where is the usual sup-norm. The 2-Wasserstein distance between distributions and will be denoted by d_{W}(G_{1},G_{2})=\sqrt{\int_{0}^{1}\big{(}G_{1}^{-}(u)-G^{-}_{2}(u)\big{)}^{2}du}. All the proofs are developed in Section A.2 of the Appendix (Supplementary Material).
4.1 Identifiable Regime
We first focus on the identifiable regime as given in Definition 1. Our first two results concern the fully observed case, as described in Section 3.1. Write for the mean function and for the covariance operator, where for any triple . Let denote the trace norm for operators on . The covariance kernel of is denoted by and the empirical covariance kernel of the ’s is denoted by .
Theorem 2** (Strong Consistency – Fully Observed Case).**
Further to the assumptions in Definition 1, assume also that is Hölder continuous with exponent . Then, the estimators in Section 3.2 satisfy the following asymptotic results, where convergence is always with probability one:
- (a)
* as .*
- (b)
* and as for each .*
- (c)
* as for each .*
- (d)
* as for each , where is the local variation measure associated with .*
- (e)
* as , where .*
- (f)
* and as . Moreover, and as for each .*
Furthermore, if we additionally assume that and almost surely for a deterministic constant (call this “Condition 1”), then the following stronger results hold with probability one, in lieu of (b), (c), and (e):
- (b’)
* and as for each .*
- (c’)
* as for each .*
- (e’)
* as , where .*
Remark 5**.**
A few remarks concerning the last theorem are as follows:
Independence
The strong consistency results in Theorem 2 do not require that and are independent.
Uniformity
It is observed from the proof of the uniform convergence of in part (b) of the above theorem that
[TABLE]
almost surely. Under Condition 1, the same conclusion is true now with the finer norm . The convergence in part (d) also holds uniformly for all .
Fisher Consistency
It can be directly verified that so that , where . Also, , , and for each . Further, , where . Thus, , and . Since all of the above estimators are measurable functions of the sample averages of the ’s, the ’s and the ’s, it follows that all of the above estimators are Fisher consistent for their population counterpart.
A Concrete Example
The condition almost surely for a deterministic constant can be relaxed to almost surely for i.i.d. positive random variables provided we assume that . An example of random warp functions that satisfy can be found Section 8 of Panaretos and Zemel (2016). Define and for , define for some . If is an integer-valued, symmetric random variable, then . For a fixed , let be i.i.d. integer-valued, symmetric random variables, and be i.i.d. random variables independent of the ’s. Define . Then, is a strictly increasing homeomorphism on , surely, . Further, it can be easily shown that . Thus, the condition holds if we choose .
Further to strong consistency, we also derive weak convergence of the estimators:
Theorem 3** (Weak Convergence – Fully Observed Case).**
Further to assumptions in Definition 1, assume also that is Hölder continuous with exponent , that and are independent for each , and that . Then, the estimators in Section 3.1 satisfy the following asymptotic results,
- (a)
* converges weakly as .*
- (b)
* and converge weakly in the topology as for each .*
- (c)
* converges weakly in the topology as for each .*
- (d)
* converges weakly as for each .*
- (e)
* converges weakly to a zero mean Gaussian distribution in the topology as .*
- (f)
* converges weakly in the topology of Hilbert-Schmidt operators, and converges weakly in the topology as . In both cases, the limits are zero mean Gaussian distributions. Moreover, converges weakly to a zero mean Gaussian distribution in the topology, and converges weakly as for each .*
Since is a stronger topology than for any finite , it follows that the weak convergence results in the above theorem which hold in the topology also hold in the topology by virtue of the continuous mapping theorem.
We shall now study some the asymptotic properties of the estimators in the discrete observation setup (without measurement error).
Theorem 4** (Limit Theory – Discretely Observed Case Without Measurement Error).**
Further to the conditions of Theorem 3, assume that , for some , and that almost surely for a deterministic constant . Define . Assume that and are independent for each (only for the weak convergence statements). The kernel is assumed to be supported on . If and satisfies as , then the estimators introduced in Section 3.2 satisfy
- (a)
* as almost surely, and as .*
- (b)
* and as almost surely. Further, and converge weakly in the topology as for each .*
- (c)
* as almost surely, and converges weakly in the topology as for each .*
- (d)
* as almost surely, and as for each .*
- (e)
* as almost surely, and converges weakly in the topology as .*
- (f)
* as almost surely, and converges weakly in the topology of Hilbert-Schmidt operators. Further, as , and converges weakly in the topology as . Moreover, as almost surely, and converges weakly in the topology. Also, as almost surely, and converges weakly as for each . *
In all the weak convergence results stated above, the limits are identical to the corresponding limits obtained in the fully observed scenario in Theorem 3.
Remark 6**.**
Independence
As in the fully observed setting in Theorem 2, the strong consistency results in the discrete, noiseless observation setting in Theorem 4 do not require and to be independent.
Variable observation grids
The asymptotic results remain valid in the case where the grid over which the ’s are observed, say, , differs with . The proof, however, will be notationally quite cumbersome. In this case, the requirement on the grid will be as follows: as for each , and satisfies as .
Smoothing parameter choice
The choice of in Theorem 4 is an under-smoothing choice. It is made on account of the absence of measurement errors in the observations, which enables us to under-smooth the data without damaging -consistency. This is unlike what happens in classical non-parametric regression due to the presence of errors in that scenario. Also, the boundary points inflate the bias of the Nadaraya-Watson estimator to an order of (the same order as that obtained in Theorem 4 for all points). However, these issues are of no consequence in this scenario. It is also natural to under-smooth in this situation since appropriate under-smoothing retains the features of the curves better and allows estimation at a parametric rate even under non-parametric smoothing. If instead of the Nadaraya-Watson estimator, one uses a local linear estimator with bandwidth , then the bias is of order (even at the boundaries). In this case, has to be to achieve parametric rates of convergence, which is again an under-smoothing choice. Thus, the choice of smoothing method does not play a crucial role in this setup.
Topology of convergence
Unlike Theorem 3, the weak convergence results are all in the topology. This is because unlike the fully observed case, the estimators involved are not continuous functions in . We could not consider the weaker topology since not all estimators will be càdlàg functions. However, we still retain the strong consistency results in parts (b), (c) and (e) in the sup norm similar to Theorem 2. This is due to the fact that those estimators are uniformly bounded almost surely, and thus have finite sup-norm. Further, in all cases, there is no issue with the measurability of the supremum.
Assumption on eigenfunction
The condition can be relaxed to requiring that is Lipschitz continuous. Moreover, the requirement for some is not restrictive. Of course, it holds if is bounded away from zero on , in which case one can choose . Consider the case when and let be such that . If , then we can choose an interval such that . Then, a first order Taylor expansion yields for any . Here, we have used the fact that for any iff . Thus, if none of the zeros of and coincide, then the condition holds for any . In general, if for some , and be the least integer between and such that none of the zeros of and coincide, then holds for any .
We finally study the asymptotic properties of the estimators in the modified registration procedure employed when one has contamination by measurement error (described in Section 3.3).
Theorem 5** (Limit Theory – Measurement Error Case).**
In addition to the assumptions of Theorem 3, assume that , for some . Define . Assume that and are independent for each . Suppose that a.s. and almost surely for a deterministic constant . The kernels and are assumed to be supported on , symmetric and continuously differentiable. The errors are assumed to be a.s. bounded. Also assume that as well as each of , and are finite. The bandwidths satisfy , . Then, the estimators in Section 3.3 satisfy the following properties.
- (a)
* as .*
- (b)
Both and are as .
- (c)
* as .*
- (d)
* as .*
- (e)
* as . Consequently, and have the same rates of convergence for each fixed .*
Remark 7**.**
Smoothing techniques
Analogous rates of convergence can also be obtained if one uses different non-parametric smoothing techniques than the ones in the theorem. One may, e.g., use a Nadaraya-Watson estimator in Step 3** with boundary kernels to alleviate the boundary bias problem that is well-known for this estimator (see, e.g., Wand and Jones (1995)). Also, to estimate , one may use higher order local polynomials with even orders. However, these will be computationally more intensive as well as need additional smoothness assumptions on the latent process and the warp maps.
Rates of convergence
It is observed in the above theorem that the rates of convergence are slower than the parametric rates achieved in the earlier settings due to the non-parametric smoothing steps involved – especially the estimation of derivatives, which is known to have quite slow rates of convergence. Further, the contributions of the two smoothing steps in the convergence rates are clear. It is well known in local linear regression that the optimal rate for is and that for is . With these rates, we have , and the remaining quantities are . Thus, parametric rates of convergence is achieved if .
Assumption on principal component
Let and observe that since . The condition in Theorem 5 is obviously satisfied if is bounded away from zero. Suppose that has a continuous density , say, either on or on in which case it is assumed to be symmetric about zero. If for some , then it is easy to show that if , which is quite general in view of point (4) in Remark 6. If , then this expectation is finite if .
4.2 Potentially Unidentifiable Regime
As emphasized before (Section 3.1), our procedure can be used whether or not the latent process falls in the identifiable regime of Definition 1. In this section, we carry out a theoretical analysis of the stability of our registration procedure when the distribution of the latent process deviates from the identifiable regime. Since identifiability is lost, it is clear that consistency is no longer achievable. However, we can quantify how much the estimators deviate from their population counterparts, at least asymptotically. Since the model may fail to be identifiable, strictly speaking there is no unique setting corresponding to the law of the data. For this reason, as a convention, we will assume that a “true” underlying distribution is known and fixed. For simplicity of exposition, we focus on the rank two case. This will be seen to carry the essence of the underlying effects, as we discuss in the third point of Remark 8. To obtain more transparent results, we focus on the case where the underlying functions are completely observable as continuous objects.
Let for , where and are uncorrelated, and and are two orthonormal elements of . Let . Denote and for . Then,
[TABLE]
gives the Karhunen-Loève expansion of . The (random) local variation distribution induced by is for . Note that contrary to the rank one case, where did not play a role in (due to cancellation of the term from the numerator and the denominator), here it cannot be neglected. We will later see that it will play a role in the performance of the estimators. Defining , which is the square root of the inverse of the condition number, it follows that
[TABLE]
The local variation distribution induced by the observed warped data is given by
[TABLE]
The idea is that if under suitable conditions the ’s manifest small variability, then the registration procedure will work quite well. We will illustrate two different situations where this is the case. The estimators of the population parameters will be the same as those considered earlier. The next theorem gives bounds on the estimation errors.
Theorem 6**.**
In the setting of the model provided in (4), define
[TABLE]
for . If , assume that for some , and if , assume that for some . Set . Suppose that assumption (I2) from Definition (1) holds and that for each , lie in with the derivative being -Hölder continuous for some . Assume that and are independent for each . Also assume that . Then:
- (a)
, and almost surely, where the constant term is uniform in .
- (b)
* almost surely.*
Remark 8**.**
Model misspecification
Theorem 6 reveals that if the are small, the effect of misspecification is also small. Here are two such cases:
(a) When , . So, in this case, if has a large enough contribution compared to for all , then the ’s are small.
(b) On the other hand, if , then if is small, i.e., the condition number of the process is large (which essentially implies that the process is “close” to a rank one process provided ), then the ’s are small. This can be compared to the minimum eigenvalue registration principle of Ramsay and Silverman (2005), where one tries to find the warp function that minimises the second eigenvalue of the cross-product matrix between the target function and the registered function. Assume that and without loss of generality that . If in reality the true unobserved curves are rank one, i.e., the component, and we observe warped versions of the rank two curves ’s, then (in the population case) correct registration is achieved by if the minimum eigenvalue, namely , of the expected cross-product matrix equals zero. Thus, in the empirical case, if is close to zero, we may expect to be close to and consequently expect the registration procedure to have good performance.
Convergence of other model parameters
Bounds similar to those in (a) and (b) of Theorem 6 can also be obtained for the mean, the covariance, the ’s and the ’s as well as the principal components ’s. We do not include them in the statement of the theorem because they need more complicated conditions involving the parameters.
General (possibly infinite) rank situation
Let for some , where the are uncorrelated with zero mean and unit variance. Without loss of generality, we assume that . The errors in estimation when remain the same as in Theorem 6. When , then we define for , where for . In this case, under the conditions of Theorem 6, the bounds as in that theorem still hold true. Note that for all . So, in the general case, the performance of the registration procedure studied in the paper will only depend on how small is and does not in general depend on the values of the ’s (or the ’s for ). In other words, only the behaviour of the second frequency component relative to the first one matters (which elucidates the role of in the standard model, i.e. Equation 1, whose role is precisely to tune this behaviour). Of course, the magnitude of the error in estimation for the same value of will now differ from the rank case because of the presence of the additional terms. We have investigated these issues in a simulation study in Section 5.3 (see, in particular, Figure 6).
Comparison with pairwise warping
In the setup of the infinite rank latent model considered in item (3) above, we now compare the bounds obtained in Theorem 6 to those obtained by Tang and Müller (2008). Denoting , it follows that the latent model is exactly the same as considered in that paper (see p. 877 with there replaced by ). So, if , it follows that , which is similar to the bound obtained in Tang and Müller (2008). Our analysis nevertheless refines the results of Tang and Müller (2008) in the sense that it reveals the impact of on the asymptotic bias – larger magnitudes of yield smaller asymptotic bias. Further refinements can be offered by differentiating between the cases and . Specifically, when , it can be shown that . Thus, in this case, the error bounds on the warp maps in Theorem 6 do not depend on . This is to be expected for the following reason. Note that means that the latent process in this case is for a constant , and hence, the warped process is . Thus, the warped version of the process differs from the warped version of the process only by a constant shift and a scale factor. Ideally, any proper registration procedure should be invariant with respect these transformations since they do not affect the time scale. This is clearly true for our procedure. We should thus get the same estimates of the warp maps if we work with the warped process (which does not involve ) instead of .
5 Numerical Experiments
We now carry out simulation experiments to probe the finite-sample performance of our registration procedure. First we treat the case of a well-specified identifiable regime without error, and then separately the case when there are measurement errors in the observations. Finally, we consider the setup when the rank of the latent process is more than one (departure from identifiability). In all cases, we have compared the performance of the proposed registration method to the continuous monotone registration (CMR) method by Ramsay and Li (1998), the pairwise registration (PW) technique of Tang and Müller (2008) and registration using the Fisher-Rao metric (FMR) studied in Srivastava et al. (2011). The CMR procedure is implemented using the “register.fd” function in the R package fda. The PW procedure is implemented using the Matlab codes in the PACE package. The FMR method is implemented using the “time_warping” function in the R package fdasrvf. The tuning parameters in the PW method are always chosen to be the default ones since the other choices were found to be computationally extremely intensive. For the CMR procedure, we compared its performance by using different numbers of B-spline basis functions in the structure of the warp maps (see Ramsay and Li (1998)). This varies their complexity. However, we found that the best performance was obtained when the warp maps are simple. As will be seen in the simulations, the registration procedures involving structural assumptions on warp maps and consequently more tuning parameters (CMR and PW) encounter difficulties in several of the models considered, which is probably due to the mis-specification of the true warping mechanism.
5.1 Identifiable Regime Without Measurement Error
Let , and consider two models:
- Model 1:
, ;
- Model 2:
, .
In either case, the sample size is and the curves are observed at equally spaced points in . The warp maps are chosen according to the last point of Remark 5 with the parameters , , where , with independent of , and .
The kernel for the Nadaraya-Watson estimator as well as the one used to smooth the ’s is the Epanechnikov kernel on . For both the models, the bandwidths used in the registration procedure were chosen to under-smooth the data so that the features (maxima, minima, etc.) are not smeared out. In order to provide smooth registered curves, we have smoothed the ’s using cubic splines with equi-spaced knots on , prior to synchronising the data.
Figure 2 shows the plots of the true, warped and registered data curves; the true, warped and registered means; and the true, warped and registered leading eigenfunctions under Model 1 and Model 2. Figure 2 suggests that the procedure studied in this paper has been able to adequately register the discretely observed and warped sample curves. Moreover, it is clear that the cross-sectional mean and the leading eigenfunction of the warped curves differ from the true mean and leading eigenfunction in either amplitude or phase (under either model), while the registration procedure corrects the problem, and the resulting estimates (whether smoothed or raw) are very close to the true functions.
Under both the models, it is seen that the estimates of the mean and the leading eigenfunction obtained using the proposed registration procedure is closest to the true functions compared to all the other methods considered. This is more prominent under Model 2 (see the bottom two rows in Fig. 2), where the estimates of the leading eigenfunction obtained by all of other competing procedures considered are far from the true eigenfunction. Also, the registered functions obtained using the CMR and the PW methods do not resemble the true functions (see Figures 8 and 9 in Appendix A (Supplementary Material)). The above facts show that for small sample sizes, even under no measurement error, some of the well-known registration procedures may yield unsatisfactory results, while the proposed procedure works well in these cases.
5.2 Identifiable Regime With Measurement Error
We now consider the situation when the warped observations under an identifable rank one model have been observed with measurement errors. As observed in our theoretical study in Section 4.1, the rate of convergence will be much slower than the case when there is no measurement error. For our simulations, we thus keep the same two models as in Section 5.1 but increase the sample size to . The measurement errors under Model are i.i.d. while those under Model are i.i.d. . The bandwidths for the smoothing steps involved in the registration procedure are chosen using built-in cross-validation bandwidth choice function “regCVBwSelC” in the locpol package in the R software. Figures 3 and 4 show the plots of the unobserved true rank one curves, the warped curves that are observed with error and the registered curves. They also contain the plots of the mean function and the leading eigenfunction of the true, warped and registered data under the two models.
It is observed that even subject to measurement error contamination, the proposed registration procedure is able to adequately register the curves. In particular, under Model 2, the means as well as the leading eigenfunction of the true and the registered curves are quite close. We also performed the registration procedure with a Nadaraya-Watson estimator (without boundary kernels) for obtaining an estimate of the ’s (see Step 3**). The performance was not that different from the one using a local linear estimator.
Only the FRM procedure fares similarly as the proposed one when estimating the leading eigenfunction under both models. However, the PW method yields quite similar estimates of the mean as the proposed method and the FRM method under each of the two models. Both the CMR and the PW methods fail to produce adequately registered curves as is seen from Figures 10 and 11 in Appendix A (Supplementary Material). The improvement in the performance of the FRM technique under Model 2 with error compared to the case without error considered in the previous subsection (see the plots in the last row in Fig. 2) is perhaps due to the increased sample size, which compensates for the measurement error.
5.3 Potentially Unidentifiable Regime
We next carry out experiments to probe the performance of the registration procedure in a rank and a rank setting – these correspond to a potentially unidentifiable regime. The model considered in the rank case are with , , and . In the rank case, we consider with the same choices of and as above for along with and . The warp maps are the same as those considered in the simulation study in Section 5. The plots of the true curves, the warped curves and the registered curves are provided in Figure 5 for the rank and the rank models. The potentially unidentifiable setting has to be interpreted as follows: in light of Theorem 1 and the ensuing counter-examples, there may be other models that could have generated the (statistically) same data. Consequently, strictly speaking, we cannot really talk about good or bad performance, as we there may be several equally valid “ground truths” to compare to. But the way we have constructed the potentially unidentifiable simulation setting is by means of a mild departure from an identifiable model. Therefore, we can arbitrarily consider that the latter identifiable model is the truth and investigate whether the registration procedure is stable to the said mild departure. A more detailed investigation of stability is pursued later in this subsection.
It is observed that the registration procedure performs quite well and aligns the peak (present in the true curves) adequately under both models (see Figure 5). Further, the two smaller troughs near the end-points present in the rank model are also reasonably aligned (see the plots in the third row in Figure 5). However, except the FRM procedure, the other two competing methods completely fail in registering the data curves (see Figures 12 and 13 in Appendix A (Supplementary Material)). Also, unlike our procedure, the registered curves using the FRM procedure seems to lack the two troughs present in the original curves near the boundary points for the rank model. For each of the two models, the mean seems to be estimated very well based on the registered curves using our procedure. The other procedures follow suit. A similar statement is also true for the first eigenfunction under these two models. However, there is more bias in the estimate of the second eigenfunction under the rank model for all of the registration procedures. Under the rank model, the CMR and the PW methods are not fully able to capture the shape of the second eigenfunction, while our procedure and the FRM method does. The third eigenfunction under this model is somewhat reasonably estimated only by our procedure.
In order to probe the breakdown point of the proposed registration procedure in the rank setting, we also considered classes of rank and rank models, recorded the relative -error in estimation of the data curves, i.e, the median of , and consider a threshold of error as a criterion for good performance. The models are generated similar to the earlier simulation. For the rank case, let , where , , where and . The choices of and ensure that we include both approximately rank models ( and close to zero) as well as proper rank models (large values of ). Similarly, for the rank case, let , where . Figure 6 shows a plot of the relative -errors under these classes of models, for various combinations of the parameters and . It is seen that when is large, the performance of the registration procedure is good, which conforms with our theoretical arguments in Theorem 6. In fact, for this class of rank models, the maximum error does not exceed . On the other hand, when is small, the allowable range of values for good performance is much greater in the rank setup compared to the rank setup (cf. (c) in Remark 8). In fact, in the rank setup, the error is more than for all in the range considered when . Further, the maximum error is now .
6 Data Analysis
In this section, we illustrate the performance of our registration procedure on a data set of growth curves of Tribolium beetle larvae, collected and analysed by Irwin and Carter (2013). Each curve represents the mass measurement (in milligrams) as a function of the age of the larvae since hatching (in days). Their analysis of Tribolium growth suggests that these beetles’ growth patterns differ from those of other animals with determinate growth (that is, growth that is contained in certain life stages). Usually, the longer the growth period, the larger the maximal mass attained (see Irwin and Carter (2014), and references therein). In Tribolium, however, it seems that beetles that tend to grow faster, and thus have a shorter growth period, also tend to attain larger size (e.g. Figure 7, top left). See Irwin and Carter (2013) for more details and background. This observation suggests that the Tribolium data could be well-suited for a phase-amplitude analysis under a latent rank 1 model that has been warped: one expects that correcting for different “growth clocks” (phase variation) should yield curves that are roughly of unimodal amplitude variation, due to final mass. Conversely, it suggests a potential latent model that produces rank 1 vertical variation related only to final mass, and horizontal variation due to growth timing (i.e. how this total final mass is accumulated in time).
For our analysis, we have only considered the part of the dataset where there were at least discrete measurements per individual curve, which results in a sample size of . Also, not all larvae were recorded on the same day so that the number of observations differed across individuals. Since there are relatively few measurements (maximum ) per individual larvae, we smoothed each observation vector as a pre-processing step. This was done using the built-in function splinefun in the R software with the method monoH.FC that uses monotone Hermite spline interpolation proposed by Fritsch and Carlson (1980) (since the curves are expected to be approximately increasing).
As is typically the case with growth curves, one expects that, if unaccounted for, the lurking phase variation would give the impression of several modes of amplitude variation. The aim our analysis is thus to register the curves, estimate the warp maps, estimate the mean of the registered curves, and carry out an eigenanalysis of the registered data.
It is indeed observed that prior to any registration, the data present at least two substantial modes of amplitude variation, with the first three principal components explaining , and of the total variation, respectively. However, after registration using our method, the empirical covariance operator is almost precisely of rank 1, with the leading principal component explaining of the total variation. Interestingly, the mean of the registered data has the same shape as the leading eigenfunction and is in fact roughly equal to times the leading eigenfunction. This can be seen as a model diagnostic, corroborating the model: if the rank 1 model were correct, then after registration one would expect to have a single mode of amplitude variation and a mean in the span of the corresponding eigenfunction (see the discussion after Counterexample 1).
Figure 7 show the plots of the actual data, the monotone spline smoothed data and the registered data, as well as the plot of the estimated warp maps and the average warp map, which is very close to the identity. It also shows the plots of the mean and the leading eigenfunction of the warped and the registered data. Although the means of the warped and the registered data are very close, there are substantial qualitative differences between the corresponding eigenfunctions. The eigenfunction of the registered data shows that the variation in growth pattern essentially starts at about the days after hatching. Between ages days post hatching, there is a notable increase in the growth variation, and it somewhat recedes after that age. These periods are in fact compatible with biologically interpretable phases of growth: the larvae enter an “instar” (a distinct growth period between exoskeleton moults) characterised by exponential growth at around day 7-8; then, around day 17, they enter the “wandering phase” and begin losing weight in preparation for pupation.
The performance of the FRM technique is very similar to the proposed procedure and results in an almost rank one registration. However, the CMR and the PW procedures do not yield a rank one registration although the estimated means are very similar to that obtained by our procedure, which is observed by comparing Figure 7 with Figure 14 in Appendix A (Supplementary Material). However, the difference lies in the registered curves and the estimate of the leading eigenfunction. The latter shows some artifacts which do not conform to the biological explanation provided earlier, e.g., the presence of flat regions in the estimated eigenfunction during the “instar” phase of exponential growth as well as the growth spurt towards the end where the larvae would actually enter the “wandering phase”.
Acknowledgements
We are grateful to Dr. Kristen Irwin (EPFL) for kindly sharing and discussing her Triboleum data set. We also thank the reviewers for their constructive feedback.
Appendix A Supplementary Material
A.1 Some Practical Issues
As mentioned earlier, is a step function with jump discontinuities at the grid points. In particular, for and for . Thus, and , which is less than if , i.e., the grid does not include the right end-point. In this case, and thus is properly defined only for . Also, and equality holds iff . Thus,
[TABLE]
[TABLE]
Then, . One can then extend to the whole of by, e.g., linearly interpolating between and . This practical modification, in case , enjoys the same asymptotic properties as the originally defined estimator (Section 4), since the effect of the modification is asymptotically negligible due to the homogeneity assumptions on the grid.
Similarly, iff . So, in case , we have . This is not a problem since this estimator is not used in the registration procedure and the problem disappears asymptotically anyway, just as described above.
We conclude this section by noting that, since the estimates of the warp maps do not involve any smoothing and are obtained from compositions of step functions, the resulting registered curves will not be very smooth. This will be particularly noticeable if the number of grid points is small. Note that even in that case, the estimated mean function will be smoother if the sample size is moderately large. If one is interested in obtaining a smooth registration of the sample curves, the following procedure may be adopted. First, we produce smooth versions of the by some non-parametric smoothing procedure, e.g., polynomial splines of a fixed degree , and call these new estimates as , say. Then, we plug-in these smoothed estimates of the warp functions and define the new registered observations as . It is well-known that a spline smoothed estimate of a smooth function converges to that function in the sense provided the oscillations of the function go to zero as the number of knots grows to infinity (see Theorem 6.27 in Schumaker (2007)). The latter holds for the ’s since they lie in (see equation (2.121) in Theorem 2.59 in Schumaker (2007)). Thus, this modified estimator will also provide consistent registration.
A.2 Proofs of Formal Statements
Proof of Lemma 1.
Since , we have
[TABLE]
by Definition 2. Next, so that . Thus, using the strict monotonicity of , we have
[TABLE]
A standard change-of-variable argument and the fact that is a bijection with and now yields . So, , equivalently, . Using the assumption that , we now have . ∎
Proof of Theorem 1.
Note that is a Lipschitz map. Thus, implies that . Consider the random probability measure given by
[TABLE]
for in the Borel -field of . Similarly, . We equip the space of diffuse probability measures on with the -Wasserstein metric (see, e.g., Villani (2003)) given by , where and are the distribution functions associated with the probability measures and . Now for any satisfying for , consider the measure with density for . The condition is equivalent to . Since and are supported on the bounded set , it follows from Proposition 7.10 in Villani (2003) that for a constant , where is the total variation distance. It now follows that
[TABLE]
Thus, the embedding is continuous when the domain, say, is restricted to the set of all non-constant functions on . But the set is a one dimensional linear subspace spanned by the constant function , and this implies that is a Borel measurable subset of . So, is a Borel measurable subset of . Equip with the Borel -field induced from . Since , we have that is a valid random probability measure on . Note that for any Borel subset of , we have . Thus, for any Borel subset of , we have
[TABLE]
The first equality follows from the continuity of on and the fact that discussed above. The second equality follows from the fact that and have the same distributions by assumption. So, as random probability measures.
Next, note that the random measures , , have strictly increasing cdfs almost surely. Proposition in Panaretos and Zemel (2016) states that for each , the map admits a unique minimizer given by , where is the random distribution function of the random measure . Since with being a strictly increasing homeomorphism on , it follows from the change-of-variable formula that . Thus, , equivalently, , where is the cdf associated with the (deterministic) probability measure .
Note that has a continuous and strictly increasing cdf since is zero only on a countable set for . Since , it follows that the minimizer for . But since , it now follows that . Also, , equivalently, . Using the above facts and the result obtained in the previous paragraph, it now follows that .
We next claim that the joint distributions of , are the same. To this end, consider the map defined from to with the latter being equipped with the induced product topology and the induced product -field. It follows from the same arguments used to prove the continuity of that is continuous. Thus, for Borel subsets and of , we have
[TABLE]
Next, note that is the true unobserved process. It is easy to show that the map from into is continuous. Thus, using the observation in the previous paragraph, we have as random elements in . It follows from the equality of distributions that their covariance operators are equal, and thus the corresponding eigenfunctions are equal. Now, the covariance operator of is given by . Since is a rank one process, the equality of the covariance operators implies that (since ). This equality along with the fact that implies that . ∎
We will now state and prove the statement regarding identifiability of the rank two model obtained as the rank 1 model plus a random vertical shift, as discussed in Remark 1.
A general rank two model would be given by the Karhunen-Loève-like expansion , where and are uncorrelated random variables, and and are orthonormal eigenfunctions of the covariance operator of (we assume here that the mean function lies in the range of the covariance operator of ). The rank two model discussed in Remark 1 satisfies . Thus, the following conditions are satisfied for the model :
- (a)
= 0, and
- (b)
and are uncorrelated.
We call these two conditions Condition (R).
Theorem 7**.**
Consider two random elements in defined as for real-valued random variables , deterministic functions with and vanishing on at most a countable set. Assume that both and satisfy Condition (R) above. Further, assume that are strictly increasing homeomorphisms in , and such that . Write for . Then,
[TABLE]
Proof.
Note that is a Lipschitz map. Thus, implies that . Observe that is the same as the derivative of the process , where is a rank one process defined by , and . So, the proof of Theorem 1 shows that . It also follows from that proof the joint distributions of and are the same. Hence, , i.e., , where the constant function one is denoted by .
Observe that by Condition (R), for . So, , and hence there exists a non-zero such that and . Thus,
[TABLE]
which implies that .
Next observe that
[TABLE]
for each . Using Condition (R) along with the fact that (shown previously), it now follows that
[TABLE]
Using the fact that , the above equality implies that
[TABLE]
Since , the above equality implies that . Finally, there exists a non-zero such that and . So,
[TABLE]
which implies that . ∎
Proof of Theorem 2.
First observe that the ’s are also i.i.d. random elements in . Moreover, since is strictly increasing and positive, we have . Thus, by the strong law for Banach space valued random elements (see, e.g, Theorem 2.4 in Bosq (2000)), it follows that as almost surely. In addition, if implying that , then the almost sure convergence holds in .
(a) Since , using Theorem 2.18 in Villani (2003), we get that
[TABLE]
(b) Since each is a strictly increasing bijection on , we have
[TABLE]
Since both and are strictly increasing homeomorphisms, the uniform convergence of to follows as a consequence of the above uniform convergence.
Suppose now that Condition 1 holds. We have discussed towards the beginning of the proof that in this case as almost surely. In view of the first half of part (b) of the theorem along with the definition of the norm, it is enough to show the uniform convergence of the derivatives. Since each is a strictly increasing bijection on , so is for every . First note that
[TABLE]
where is the constant function taking value . It thus follows from an earlier bound that
[TABLE]
Next note that so that . Now,
[TABLE]
Since is continuous on , it is uniformly continuous. This and the fact that as almost surely implies that as almost surely. Combining this fact with the uniform convergence of to , we get that as almost surely.
(c) Note that
[TABLE]
since as almost surely, and is continuous on and hence uniformly continuous.
Suppose now that Condition 1 holds. Then, as before,
[TABLE]
Using similar arguments as earlier, we conclude that and hence as almost surely.
(d) Observe that since , it follows from the change-of-variable formula that . Thus,
[TABLE]
(e) Observe that
[TABLE]
Since the ’s are i.i.d. random elements in with , we conclude from the strong law for Banach space valued random elements that as almost surely. Also, from the proof of part (c), we have that
[TABLE]
as almost surely. Thus, using similar arguments as in part (c) of the theorem, we obtain as almost surely. Combining the above facts, we conclude as almost surely.
Note that since , it follows that as almost surely. Now, suppose that Condition 1 holds. A similar decomposition as above yields
[TABLE]
The proof of part (c) implies that
[TABLE]
The right-hand term above converges to zero as almost surely. The result is now established upon combining the above facts.
(f) Straightforward algebraic manipulations yield
[TABLE]
Denote . Then,
[TABLE]
Using the Cauchy-Schwarz inequality, we have , and as almost surely. It follows from the arguments in the proof of part (c) of the theorem that
[TABLE]
and the right hand side is as almost surely since . Further, as almost surely. Thus, as almost surely.
The proof of the uniform convergence of to is obtained by use of a decomposition of similar to the one used above, noting that converges uniformly to (by the strong law of large numbers in ), and the fact that all the other bounds hold in the supremum norm.
Next, note that and for all , where as almost surely. Also, as almost surely. So,
[TABLE]
as almost surely. Thus, as almost surely.
Finally, as almost surely. ∎
Proof of Theorem 3.
We have and by assumption . So, by the CLT for i.i.d. valued random elements (see, e.g, Theorem 2.4 Bosq (2000)), we have for a zero mean Gaussian random element in .
(a) From the proof of part (a) of Theorem 2, one has that . Now, it is easy to check that the map is continuous. The result follows from the continuous mapping theorem.
(b) Note that for each fixed , we have , where and . We will first derive the weak limit conditional on . From the previous paragraph, it follows that conditional on , , and , being a constant sequence, converges conditionally in probability to as . So, by Theorem 4.4 in Billingsley (1968), conditional on , we have in the topology. Using the fact that the map is continuous in (see, e.g., p. 155 in Billingsley (1968)), it follows from the continuous mapping theorem that conditional on , as for each fixed . Thus, by the Dominated Convergence Theorem, the unconditional distribution of converges weakly as for each fixed .
To prove the weak convergence of , we will as earlier first derive its weak limit conditional on . Now, using the fact that almost surely, we have
[TABLE]
for some (possibly depending on and ). Thus,
[TABLE]
where the term is uniform in since as almost surely. Using similar arguments as in the above proof and noting that as almost surely, we deduce that as . Thus, by the Dominated Convergence Theorem, the unconditional distribution of converges weakly as for each fixed .
(c) Note that for each fixed ,
[TABLE]
where , and the term is uniform in as earlier. Similar arguments as in part (b) above yield as for each fixed .
(d) The proof is similar to that of part (a) and is omitted.
(e) Note that
[TABLE]
which follows from similar arguments as in part (c) and the independence of the ’s and the ’s.
(f) For the first part, note that
[TABLE]
Now, some straightforward manipulations yield
[TABLE]
So,
[TABLE]
The first term on the right hand side of the above equality converges in distribution to since as almost surely. For the latter reason, the third and the fourth terms converge to zero in probability as . For the second term, note that
[TABLE]
Thus, by similar arguments as in part (c) earlier, and the continuity of the mapping from to the space of Hilbert Schmidt operators, we have that the second term converges in distribution to . Combining the above observations and the fact that in probability (follows from part (e)), we deduce that
[TABLE]
as .
In order to prove the weak convergence of the empirical process in , we follow the same decomposition as in the proof of the weak convergence of the operators in the Hilbert Schmidt topology. Now, note that the proof of part (c) of the theorem implies that the empirical process in converges in distribution to the process in . This fact and the same arguments as in part (f) yield
[TABLE]
as , where does not depend on .
For the weak convergence of , first note that . Thus, . Now,
[TABLE]
Using the weak convergence of to in the topology, we have that
[TABLE]
as in the topology.
Finally, for the weak convergence of the ’s, observe that
[TABLE]
Using the independence of and the ’s, and using the asymptotic distributions obtained above and in part (c), it follows that
[TABLE]
as . ∎
In order to prove Theorem 4, we will first prove a few crucial results.
Proposition 1**.**
Assume that and almost surely for a deterministic constant . Then, for each , we have almost surely, where almost surely with the term being uniform in . Further, for all almost surely, where almost surely with the term being uniform in . Consequently, we have and for all almost surely, where and almost surely.
Proof of Proposition 1.
First, let us define and in case and . Then, is a partition of . Consider the sum and note that by a Taylor expansion, , where . The right hand side is a Riemann sum approximation of with as the partition of , since is a strictly increasing bijection. Thus, writing , we have
[TABLE]
Now for any , we have
[TABLE]
for some . Using the assumption in the theorem and that on the grid, it now follows that uniformly on . Thus, . To complete the first part of the proof, note that differs from by at most two terms, and both of these terms are uniformly over by the same arguments as those for .
For the second part, fix any . Defining , there is nothing to prove when . For , define . If is the largest for which , define if . Note that depends on . Then, is a partition of , and hence is a partition of . Define . Then, by similar arguments as earlier, we have
[TABLE]
Thus, uniformly over . The proof is completed upon noting that differs from by at most two terms, and both of them are uniformly over by the same argument as before.
The last statement of the proposition is an immediate corollary for the case almost surely. ∎
Note that the ’s are not continuous functions, but we can still define their norms as all of them are uniformly bounded functions on . The following corollary is a consequence of Proposition 1 and the fact that .
Corollary 1**.**
Under the assumptions of Proposition 1, we have for all almost surely for each , where almost surely uniformly over . Further, for all , where .
Lemma 2**.**
Assume that for some . Then, , where . In other words, is -Hölder continuous for .
Proof of Lemma 2.
Note that the assumption in the statement of the lemma implies that almost everywhere with respect to the Lebesgue measure on . This fact along with Zarecki’s theorem on the inverse of an absolutely continuous function (see, e.g., p. 271 in Natanson (1955)) applied to the function yields that is absolutely continuous on . Thus, . Now, using Hölder’s inequality and some algebraic manipulations, we obtain
[TABLE]
To complete the proof, choose , which implies that . ∎
Proposition 2**.**
*Assume that the conditions of Proposition 1 and Lemma 2 hold. Let as in Lemma 2. Then, for each ,
(a) is -Hölder continuous almost surely.
(b) for all almost surely, where almost surely uniformly over .*
Proof of Proposition 2.
(a) Using the definition of , it follows that
[TABLE]
where the last inequality follows from Lemma 2. This completes the proof of part (a).
(b) As mentioned earlier, is a càdlàg step function with maximum jump discontinuities given by . Thus, if for any , it follows that , where . So, , where is the maximum step size of defined earlier. Now, from arguments similar to those used in Proposition 1, it follows that uniformly in . Thus, for all almost surely, where almost surely uniformly over .
From Proposition 1, we know that for all almost surely, where almost surely uniformly over . Letting , we now have for all almost surely. Re-arranging terms, we obtain for all almost surely, where . Thus, almost surely uniformly over . Now, using part (a), we can conclude that for all almost surely, where satisfies almost surely uniformly over . ∎
Proof of Theorem 4.
(a) Note that
[TABLE]
for all almost surely, where almost surely since almost surely and almost surely. Thus, it follows from Theorem 2.18 in Villani (2003) that
[TABLE]
almost surely. Combining the above statement with part (a) of Theorem 2 and 3 completes the proof of part (a) of Theorem 4.
(b) Next, note that
[TABLE]
for all almost surely. By part (a) of Proposition 2, we have for all almost surely, where almost surely uniformly over . Thus, almost surely. Similar arguments yield
[TABLE]
almost surely. Thus,
[TABLE]
for all almost surely, where almost surely uniformly over . Consequently,
[TABLE]
almost surely, where the term is uniform over . This along with part (b) of Theorem 2 shows that as almost surely for all . Equation (5) implies that in . This in conjunction with part (b) of Theorem 3 proves that has the same asymptotic distribution as in the topology.
Next we consider for all almost surely (from part (b) of Proposition 2). Note that , where and . Thus, . Also note that is a strictly increasing homeomorphism on . Define so that is an increasing function (not necessarily strictly increasing) from onto . In fact, since each is left continuous and has right limits (being the generalized inverse of the càdlàg function ), is also left continuous and has right limits.
If for some with , then . Now, uniformly in almost surely, where the penultimate equality follows from the continuity of . So, in these cases, uniformly in almost surely, i.e., uniformly in almost surely.
Next, suppose that for some , we have , for and for . If , then if is a continuity point of . If not, then this is already taken care of in the previous paragraph. In the former case, we have uniformly over almost surely.
Finally, if is a point of both continuity and strict increment of , then as well, which implies that uniformly over almost surely. Thus, all possibilities are exhausted. Let us denote the term by .
Now note that . Thus, it follows from our work above that . Recall that and that for all almost surely as obtained earlier. Since , it follows from the decomposition of that for all almost surely. Since , it follows that . So, by Taylor expansion, we have for all almost surely, where almost surely, where the term is uniform over .
Combining the above findings, we arrive at
[TABLE]
where the last equality follows from the discussion in the previous paragraph. Since almost surely uniformly over , we obtain
[TABLE]
for all almost surely, where almost surely uniformly over . Consequently,
[TABLE]
almost surely. Combined with part (b) of Theorem 2, this shows that as almost surely for all . Equation (A.2) implies that in . This in conjunction with part (b) of Theorem 3 proves that has the same asymptotic distribution as in the topology. This completes the proof of part (b) of Theorem 4.
(c) Next we register the warped functional observations. As mentioned earlier, since the warped observations are only recorded over a discrete grid, the registration algorithm in the fully observed case will not work. So, as a pre-processing step, we need to first smooth the warped discrete observations. We do this by using the Nadaraya-Watson kernel regression estimator as follows. Let be any kernel supported on and choose a bandwidth parameter . Then, the smooth version of is given by
[TABLE]
Now, note that
[TABLE]
for all almost surely, where is a constant not depending on and . The first inequality above follows from arguments similar to those used in the proof of Theorem 1. The second inequality follows form the fact that is supported on so that only those ’s in the numerator for which will contribute to the sum. Thus, almost surely.
We register the warped discrete observation by defining for each . Observe that
[TABLE]
for all almost surely, where the term is uniform in and . The last two inequalities above follow from a first order Taylor expansion and the fact that almost surely uniformly over . Hence,
[TABLE]
almost surely. In conjunction with part (c) of Theorem 2, this shows that as almost surely for all . Equation (6) implies that in . Invoking part (c) of Theorem 3 thus establishes that has the same asymptotic distribution as in the topology. This completes the proof of part (c) of Theorem 4.
(d) Next, define the random measure induced by as
[TABLE]
for all almost surely, where the term is uniform in and , and the last equality follows from the fact that almost surely. Also note that by definition of , the term cancels from the numerator and the denominator.
Using the fact that with almost surely, and arguments similar to those used in the proof of Proposition 1, one obtains
[TABLE]
for all almost surely, where the term is uniform in and almost surely. Now, using Lemma 2 and arguments similar to those used in the proof of part (b) of Proposition 2, we have
[TABLE]
for all almost surely, where the term is uniform in and almost surely. Thus,
[TABLE]
almost surely. Combining the above statement with part (d) of Theorems 2 and 3 completes the proof of part (d) of Theorem 4.
(e) Next, define . Since almost surely, it follows that
[TABLE]
almost surely since . Along with part (e) of Theorem 2, this shows that as almost surely. Equation (7) implies that in . So by part (e) of Theorem 3 we see that has the same asymptotic distribution as in the topology, and the proof of part (e) of Theorem 4 is complete.
(f) Next, we consider the empirical covariance operator of the ’s which we will denote by . Recall from the proof of part (f) of Theorem 3. Now, some straightforward manipulations yield
[TABLE]
Note that almost surely. Next, from the previous paragraph, it follows that . Moreover, almost surely. Observe that
[TABLE]
Since almost surely, it follows that the first term above is almost surely, and the second term is almost surely. Similar arguments show that almost surely. Thus, almost surely. Also, in the proof of part (f) of Theorem 3 satsifies . Combining the above facts and using the decomposition of in the proof of part (f) of Theorem 3, it follows that
[TABLE]
almost surely. This along with part (f) of Theorem 2 shows that as almost surely. By part (e) of Theorem 3, it follows that as . So, equation (8) implies that in . This in conjunction with part (f) of Theorem 3 proves that has the same asymptotic distribution as in the Hilbert-Schmidt topology.
For the convergence of the empirical covariance kernel , we follow the same decomposition as above for the case of the operator. Noting the all the bounds used for that proof remain valid in the sup-norm and using the same arguments, we arrive that
[TABLE]
for all almost surely, where the term is uniform in almost surely. This along with part (f) of Theorem 2 shows that as almost surely. Equation (9) implies that in with the term being uniform in . This in conjunction with part (f) of Theorem 3 proves that has the same asymptotic distribution as in the topology.
To prove the strong consistency and the weak convergence of the estimated eigenfunction, we will use perturbation bounds for compact operators (see, e.g., Ch. 5 of Hsing and Eubank (2015)). The leading eigenfunction of satisfies the inequality as almost surely. Further, Theorem 5.1.8 of Hsing and Eubank (2015), specifically equation (5.27), implies that has the same asymptotic distribution (in ) as that of , where, in our setup, with being the leading eigenvalue of , and being the identity operator. Thus, from the results already establishes, it follows that the asymptotic distribution of is that of . Using the expression of the asymptotic distribution of obtained in part (f) of Theorem 3 and some simple calculations, it follows that the asymptotic distribution of is that of , which is the same as in Theorem 3.
The proof of the strong consistency and the weak convergence of follows in direct analogy to that of upon using part (c) and the above facts. The proof of part (f) of Theorem 4 is now complete. ∎
Proof of Theorem 5.
First observe that
[TABLE]
Since the term will be key for our proof, we will first bound . To achieve this, we will first provide bounds on using standard tools from non-parametric regression. So, we will have to estimate the MSE for the regression problem and integrate this MSE over , when and are fixed. The expression for the MSE in the deterministic design case is the same as the conditional MSE (given design points) in the random design case with the design distribution being uniform on . Next, observe that does not depend on and and is thus uniform over (since the ’s are i.i.d.). For , the expression of this variance is given in p. 137 in Wand and Jones (1995) and equals , where the term depends on , is bounded and is uniform over . Next, we have to take into account the boundary points. Let for some . It follows from a similar analysis that even in this case, , where the term is integrable over (see, e.g. pp. 244-247 in Schimek (2000)). Similar estimates also hold for , say . Hence, we get that for all with the term being integrable over .
Next we consider the bias. In our case the degree of the fitted polynomial is one more than the degree of derivative estimated. Thus, applying Taylor’s formula and using the expressions in Thm. 9.1 and pp. 244-247 in Schimek (2000), we have for all . Here, the and terms are non-random and are integrable in . So, using the moment assumptions on the sup-norm of the derivatives of , the independence of the ’s and the ’s along with the assumption that , it follows that
[TABLE]
where the terms are bounded and do not depend on (the ’s are i.i.d). This also implies (using Markov’s inequality) that
[TABLE]
We will now proceed with the rest of the proof. First, let . From (10), it follows that , where . Thus, using part (a) of Proposition 2, it follows that for a constant . So, . Thus, , where . Define . By Hölder’s inequality, the law of large numbers, independence of ’s and ’s, and (12), we get that
[TABLE]
(a) Since , the proof follows using part (a) of Theorem 3 and (13).
(b) Note that using statements proved earlier. Now, arguments in the proof of part (b) of Theorem 3 along with (10) yield , where . Thus, , where . The proof of the first statement in part (b) of this theorem now follows using part (b) of Theorem 3 and (13).
Next consider , where from statements proved earlier. Note that if then , where . So, . Noting that , we get that , where . This follows from arguments similar to those used earlier using the smoothness of and the assumption that . Thus, we finally have
[TABLE]
The proof of the second statement of part (b) of this theorem is now completed via part (b) of Theorem 3, (11) and (13).
For proving part (c) of the theorem we will first have to control for each . Recall that
[TABLE]
where and for . Call the denominator , which is deterministic. We will first analyse the term which is defined like but with in place of . Define .
Using Taylor’s formula, we get that , where lies between and . Plugging-in this expansion in the definition of , we have
[TABLE]
for all . Note that the term involving vanishes, which plays a crucial role in putting the local linear estimator at an advantage over other standard non-parametric regression estimators near the boundary of the data set. Thus, .
By approximations of Riemann sums, we have uniformly for . Also, for , say, with , we have uniformly for . The same estimate also holds for , say, . Define for . These estimates imply that for , we have . Further, for boundary points, we have for , where . In both case, the terms are non-random (hence does not depend on ) and uniform over choices of . Note that the leading term in the squared bias term obtainable from the previous bias expression is an upper bound for the coefficient of the squared bias term in the general result obtained in Thm. 3.3 in Fan and Gijbels (1996). It can be shown using similar arguments that , where the term is non-random and uniform over . Note that for , which correspond to , we have by the symmetry of the kernel. Further, it can be shown that the denominator (which is positive by the Cauchy-Schwarz inequality) in the definition of is a strictly increasing function of and hence its infimum is achieved at , where it takes the value (again by the Cauchy-Schwarz inequality) for any non-degenerate . Thus as the numerator is uniformly bounded in . Hence, , where the and the terms are non-random (and hence do not depend on ).
We next control . Observe that this does not depend on and hence does not depend on (the errors are i.i.d.). Now,
[TABLE]
The second term on the right hand side of the first inequality vanishes due to the uncorrelatedness of the errors and the fact that the ’s are non-random. The bound for the first term follows from the a.s. boundedness of the errors, say with bound . Here, , which is a definition similar to but with a new “kernel” . As earlier, by Riemann sum approximations, we have for with the term being uniform on . Define . Then,
[TABLE]
for all , where the term is uniform over . Note that the expression of is the same as the coefficient of the variance term in the general result obtained in Thm. 3.3 in Fan and Gijbels (1996) (with necessary adaptations). Using (15), it now follows that . Hence, using the assumptions in the theorem and the bounds on obtained earlier as well as the previous bound, it follows that
[TABLE]
where the terms are bounded and do not depend in . Thus, using Markov’s inequality, we have
[TABLE]
(c) Recall that . Thus, using (14) we have
[TABLE]
The proof of part (c) of this theorem now follows from (11), (13), (16) and part (c) of Theorem 3.
(d) Observe that by (18), we have
[TABLE]
The third term on the right hand side can be bounded using Hölder’s inequality and (12) as earlier. The bounds on the first two terms are given by (17) and (13), respectively. The proof of this part of the theorem is now completed upon using these bounds along with part (e) of Theorem 3.
(e) For the proof of this part of theorem, we will use a decomposition of similar to that of in the proof of part (f) of Theorem 3. In the same notation, we obtain the following bounds on and . First, note that . Applying Hölder’s inequality and using (12), (13) and (16), we get that . Next, using part (d) of this theorem and part (e) of Theorem 3, it follows that . In a similar manner, by the Cauchy-Schwarz inequality and the bounds obtained earlier. So, using part (f) of Theorem 3, we have . The bounds for the leading eigenvalue and eigenfunction follow directly by standard bounds in the theory of perturbation of operators. ∎
Proof of Theorem 6.
First assume that . Then, define and for and . Some algebraic manipulations yield
[TABLE]
Thus, almost surely for each . So , where the last equality holds because is a bijection on .
Next, let and . So, . Also, so that . The conditions of the theorem and arguments as in Lemma 2 earlier show that is -Hölder continuous for . Thus, for a finite, positive constant , we have
[TABLE]
Thus, almost surely. Consequently, almost surely. Further,
[TABLE]
as almost surely. Here, the last inequality follows from the moment assumptions in the theorem, the Cauchy-Schwarz inequality, the strong law of large numbers and the fact that the ’s (and hence the ’s) are independent of the ’s. Thus,
[TABLE]
as almost surely, where the constant term is uniform in .
Next, let . Then, . Let so that . Note that . Thus, using the assumptions in the theorem and arguments similar to those used in the proof of part (b) of Theorem 2, we have
[TABLE]
as almost surely. Therefore,
[TABLE]
as almost surely, where the constant term is uniform in .
Next, note that . So,
[TABLE]
as almost surely, where the term is independent on .
Next, consider the case when . Then, define . Some algebraic manipulations yield
[TABLE]
Similar arguments as in the case of now yield the error bounds on the estimators ∎
Plots of the registered curves for the simulated and real data sets in the main paper using some other procedures
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Billingsley (1968) {bbook} [author] \bauthor \bsnm Billingsley, \bfnm Patrick \binits P. ( \byear 1968). \btitle Convergence of probability measures. \bpublisher John Wiley & Sons, Inc., New York-London-Sydney. \bmrnumber 0233396 \endbibitem
- 2Bosq (2000) {bbook} [author] \bauthor \bsnm Bosq, \bfnm D. \binits D. ( \byear 2000). \btitle Linear processes in function spaces. \bseries Lecture Notes in Statistics \bvolume 149. \bpublisher Springer-Verlag, New York \bnote Theory and applications. \bdoi 10.1007/978-1-4612-1154-9 \bmrnumber 1783138 \endbibitem
- 3Chakraborty and Panaretos (2017) {bunpublished} [author] \bauthor \bsnm Chakraborty, \bfnm Anirvan \binits A. and \bauthor \bsnm Panaretos, \bfnm Victor M. \binits V. M. ( \byear 2017). \btitle Functional Registration and Local Variations: Identifiability, Rank, and Tuning. \bnote Tech. Report ar Xiv:1702.03556. \endbibitem
- 4Claeskens, Silverman and Slaets (2010) {barticle} [author] \bauthor \bsnm Claeskens, \bfnm Gerda \binits G., \bauthor \bsnm Silverman, \bfnm Bernard W. \binits B. W. and \bauthor \bsnm Slaets, \bfnm Leen \binits L. ( \byear 2010). \btitle A multiresolution approach to time warping achieved by a Bayesian prior-posterior transfer fitting strategy. \bjournal J. R. Stat. Soc. Ser. B Stat. Methodol. \bvolume 72 \bpages 673–694. \bdoi 10.1111/j.1467-9868.2010.00752.x \bmrnumber 275824
- 5Fan and Gijbels (1996) {bbook} [author] \bauthor \bsnm Fan, \bfnm J. \binits J. and \bauthor \bsnm Gijbels, \bfnm I. \binits I. ( \byear 1996). \btitle Local polynomial modelling and its applications. \bseries Monographs on Statistics and Applied Probability \bvolume 66. \bpublisher Chapman & Hall, London. \bmrnumber 1383587 \endbibitem
- 6Fritsch and Carlson (1980) {barticle} [author] \bauthor \bsnm Fritsch, \bfnm F. N. \binits F. N. and \bauthor \bsnm Carlson, \bfnm R. E. \binits R. E. ( \byear 1980). \btitle Monotone piecewise cubic interpolation. \bjournal SIAM J. Numer. Anal. \bvolume 17 \bpages 238–246. \bdoi 10.1137/0717021 \bmrnumber 567271 \endbibitem
- 7Gervini and Gasser (2004) {barticle} [author] \bauthor \bsnm Gervini, \bfnm Daniel \binits D. and \bauthor \bsnm Gasser, \bfnm Theo \binits T. ( \byear 2004). \btitle Self-modelling warping functions. \bjournal J. R. Stat. Soc. Ser. B Stat. Methodol. \bvolume 66 \bpages 959–971. \bdoi 10.1111/j.1467-9868.2004.B 5582.x \bmrnumber 2102475 \endbibitem
- 8Gervini and Gasser (2005) {barticle} [author] \bauthor \bsnm Gervini, \bfnm Daniel \binits D. and \bauthor \bsnm Gasser, \bfnm Theo \binits T. ( \byear 2005). \btitle Nonparametric maximum likelihood estimation of the structural mean of a sample of curves. \bjournal Biometrika \bvolume 92 \bpages 801–820. \bdoi 10.1093/biomet/92.4.801 \bmrnumber 2234187 \endbibitem
