Nonparametric estimation of jump rates for a specific class of Piecewise Deterministic Markov Processes
Nathalie Krell (IRMAR), Emeline Schmisser (LPP)

TL;DR
This paper introduces a nonparametric method to estimate the jump rate of a specific class of PDMPs, using an adaptive stationary density estimator and a quotient estimator, with theoretical risk bounds and simulation validation.
Contribution
It develops a novel adaptive estimation procedure for the jump rate of PDMPs, achieving nearly minimax optimality with theoretical guarantees.
Findings
Estimator of jump rate is nearly minimax optimal.
Uniform risk bounds are established for the estimators.
Simulations demonstrate the estimator's effectiveness.
Abstract
In this paper, we consider a piecewise deterministic Markov process (PDMP), with known flow and deterministic transition measure, and unknown jump rate . To estimate nonparametrically the jump rate, we first construct an adaptive estimator of the stationary density, then we derive a quotient estimator of . We provide uniform bounds for the risk of these estimators, and prove that the estimator of the jump rate is nearly minimax (up to a factor). Simulations illustrate the behavior of our estimator.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMarkov Chains and Monte Carlo Methods · Simulation Techniques and Applications · Advanced Queuing Theory Analysis
Nonparametric estimation of jump rates for a specific class of piecewise deterministic Markov processes
N. Krell
E. Schmisser Université de Rennes 1, Institut de Recherche mathématique de Rennes, CNRS-UMR 6625, Campus de Beaulieu. Bâtiment 22, 35042 Rennes Cedex, France. email: [email protected] Paul Painlevé Université des Sciences et Technologies de Lille, Bureau 314, Bâtiment M3, Cité Scientifique, 59 655 Villeneuve d’Ascq Cedex email: [email protected]
Abstract
In this paper, we consider a unidimensional piecewise deterministic Markov process (PDMP), with homogeneous jump rate . This process is observed continuously, so the flow is known. To estimate nonparametrically the jump rate, we first construct an adaptive estimator of the stationary density, then we derive a quotient estimator of . Under some ergodicity conditions, we bound the risk of these estimators (and give a uniform bound on a small class of functions), and prove that the estimator of the jump rate is nearly minimax (up to a factor). The simulations illustrate our theoretical results.
Keywords: Piecewise deterministic Markov processes, model selection, nonparametric estimation
Mathematical Subject Classification: 62G05, 62G07, 62M05, 60J25
1 Introduction
Piecewise deterministic Markov processes are a large class of continuous-time stochastic models first introduced by Davis [13]. They are used to model deterministic phenomenons in which randomness appears as point events. They are not diffusions, which adds complexity to their study. This family of stochastic processes is well adapted to model various problems in biology (see for instance Cloez et al. [10], Rudnicki and Tyran-Kamińska [29]), neuroscience (Höpfner et al [22], Renault et al [28]), physics (Blanchard and Jadczyk [9]), reliability (De Saporta et al. [14]), optimal consumption and exploration (Farid and David [18]), risk insurance, seismology,…. See also the references in the survey Azaïs et al. [4].
In this article, we consider a filtered piecewise deterministic Markov process (PDMP) taking values in , with flow , transition measure and homogeneous jump rate . Starting from initial value , the process follows the flow until the first jump time which occurs spontaneously in a Poisson-like fashion with rate . The post-jump location of the process at time is governed by the transition distribution and the motion restarts from this new point as before.
To fix the ideas, let us consider two major examples of unidimensional PDMP.
The TCP (transmission control protocol) (see Dumas et al. [17], Guillemin et al. [20] for instance) is one of the main data transmission protocol in Internet. The maximum number of packets that can be sent at time in a round is a random variable . If the transmission is successful, then the maximum number of packets is increased by one: . If the transmission fails, then we set with . A correct scaling of this process leads to a piecewise deterministic Markov process with flow and deterministic transition measure . This process grows linearly (by construction) and the constant can be configured in the server implementation (so it is also known), but the moment when the transmission fails is of course unknown. In the literature it is usually supposed that the jump rate satisfies , but with this work we can check whether it is a realistic assumption or not.
Another example of PDMP is the size of a marked bacteria (see Doumic et al. [16], Robert et al. [26], Laurençot and Perthame [25]). We randomly choose a bacteria, and follow its growth, until it divides in two. Then we randomly choose one of its daughters, and so on. Between the jumps, the bacteria grows exponentially: . The size of the bacteria after the division is random, as the bacteria does not divide itself in two equal parts.
The process is observed continuously without errors (so the flow is known); it is assumed to be ergodic, with fast convergence toward the stationary measure, and exponentially -mixing. We denote by the jump times and consider the Markov chain . Our aim is to construct a non-parametric adaptive estimator of the jump rate on a compact interval.
There exist few results concerning PDMP’s estimation. Azaïs et al. [5] and Azaïs and Muller-Gueudin [3] consider a more general model, for a multidimensional PDMP. They construct a quotient of kernel estimators, which estimate the compound function . Their estimator is consistent ([5]), asymptotically normal, and its pointwise rate of convergence depends on the bandwidth of the kernel (see [3]). They explain how to construct an adaptive estimator, but do not bound its risk.
Doumic et al. [16] and Hodara et al. [21] also consider multi-dimensional PDMPs but for very specific biological models.
Fujii [19] and Krell [24] both consider unidimensional PDMP, and provide estimators of . [19] constructs an estimator of thanks to a Rice formula, by estimating local times. He proves the consistency of his estimators. [24] considers a deterministic transition measure (so is a function of ). Her estimator of is a quotient of a kernel estimator of the stationary density of and an empirical estimator of another function with the parametric rate of convergence . This nonparametric estimator is asymptotically normal, and bounds for the pointwise risk are provided. In a very recent article, Azaïs and Genadot [2] construct a nonparametric estimator of for a multidimensional PDMP and prove its consistency.
This article is an extension of the work of [24]. We consider a wider class of models (in particular, the transition measure does not need to be deterministic any more). We bound the risk of the adaptive estimator, whereas [24] only considers the pointwise risk of the nonparametric estimator with fixed bandwidth . We also prove that our estimator is minimax (up to a factor).
For this purpose, in analogy with [24], we use the equality
[TABLE]
where is the stationary density of pre-jump locations (see Assumption A2 for the existence of this stationary density) and a function defined in equation (5). We get an estimator , which converges with rate . To estimate the density function , we use a projection method. We obtain a series of estimators of . Then we choose the ”best” estimator by a penalization method, in the same way as Barron et al. [6], and give an oracle inequality for the adaptive estimator . The constant in the penalty term is intractable, but can be estimated thanks to a slope heuristic. Finally, we construct a quotient estimator of , , and bound its -risk. In Section 2, we specify the model and its assumptions. The main results are stated in Section 3. Proofs are gathered in Section 4 and in Appendix A for the technical results. In Appendix B, some simulations for the TCP protocol and the bacterial growth are provided, with various functions . The outcomes are consistent with the theoretical results.
2 PDMP
A piecewise deterministic Markov process (PDMP) is defined by its local characteristics, namely, the jump rate , the flow and the transition measure according to which the location of the process is chosen after the jump. In this article, we consider a unidimensional PDMP . More precisely,
Assumption A 1**.**
**
- a.
The flow is a one-parameter group of homeomorphisms: is , for each , is an homeomorphism satisfying the semigroup property: and for each , is an increasing -diffeormorphism. This implies that . 2. b.
The jump rate is a measurable function satisfying
[TABLE]
that is, the jump rate does not explode. 3. c.
, .
For instance, we can take (linear flow) or (exponential flow). The transition measure may be continuous with respect to the Lebesgue measure or deterministic ().
Given these three characteristics, it can be shown (Davis [13, p62-66]), that there exists a filtered probability space such that the motion of the process starting from a point may be constructed as follows. Consider a random variable with survival function
[TABLE]
If is equal to infinity, then the process follows the flow, i.e. for , . Otherwise let the pre-jump location and the post-jump location. is defined through the transition kernel : . The trajectory of starting at , for , is given by
[TABLE]
Inductively starting from , we now select the next inter-jump time and post-jump location in a similar way. This construction properly defines a strong Markov process with jump times (where ). A very natural Markov chain is linked to , namely the jump chain (or, equivalently, ).
To simplify the notations, let us set and . By (1),
[TABLE]
and by the change of variable (we recall that for any , is a monotonic function), we get
[TABLE]
If the function is finite, we obtain the conditional density:
[TABLE]
By analogy, we set .
Our aim is to estimate the jump rate on the compact interval .
The ergodicity is often a keystone in statistical inference for Markov processes. We also assume fast convergence toward the stationary density.
Assumption A 2**.**
**
- a.
The jump rate does not explode before : for all , and . 2. b.
The process is recurrent positive and strongly ergodic. We denote by the stationary measure of , by that of , by the stationary measure of the couple and by that of . We have that:
[TABLE] 3. c.
There exist a function greater than 1, two constants , such that, for any function , , for any integer :
[TABLE]
The inequality means that, for any , . This inequality is true in particular for any function bounded by 1 and for .
Under Assumption A2a, the conditional measure is continuous with respect to the Lebesgue measure on and . So is : . Moreover,
[TABLE]
We can also remark that, for any , .
Let us set . Under Assumption A2, the empirical mean is close to its expectation under the stationary density, as shown by the following lemma (proved in the Appendix).
Lemma 1**.**
Under Assumptions A1-A2, for any bounded function :
[TABLE]
and
[TABLE]
where and depends explicitly on . We can remark that .
In the bound of the variance, the first term is the same as for i.i.d variables. The second term is due to covariance terms (we found a similar term for stationary -mixing processes), the third comes from the non-stationarity of the random vectors .
To study an adaptive estimator of , we need to prove that the Markov chain is weakly dependent. It is the case if the process is -mixing.
Definition 2**.**
Let be a Markov process. Let us define the -algebra
[TABLE]
The -mixing coefficient of the Markov chain is
[TABLE]
where is the joint law of an event on . The -mixing coefficient characterizes the dependence between what happens before and what happens after . The process is -mixing if . It is exponentially (or geometrically) -mixing if there exists two positive constants , such that .
The following lemma is a consequence of Assumption A2. It is proved in the Appendix.
Lemma 3**.**
Under Assumptions A1-A2, the Markov chain is geometrically -mixing. Moreover, its -mixing coefficient satisfies: :
[TABLE]
Estimating directly is difficult, but we can construct a quotient estimator. By (2) and (3), we get that, for any ,
[TABLE]
and we integrate with respect to the stationary distribution of
[TABLE]
recalling that is the stationary measure of the couple . Let us set
[TABLE]
Then, if , we get:
[TABLE]
It remains to ensure that on .
Assumption A 3**.**
There exists such that
[TABLE]
Remark*.*
Assumption A3 is very natural; indeed, let us set . As is invertible, and is continuous, . Then
[TABLE]
If the probability is null, then under the stationary distribution, the probability that passes through is null and the jump rate at that point can not be measured.
We can remark that if for some point , then so is and its estimator
[TABLE]
Then if we take an interval such that for some , and some observation , is positive on this interval, then Assumption A3 is satisfied on . However, the true value of is unknown in that case. It should be noted that the interval should not be changed for each simulation, otherwise the convergence of the estimator on the whole interval can not be guaranteed (the interval of estimation would become larger and larger, and as is smaller on the edges on the new interval, and the convergence of the estimator is therefore slower).
Assumptions A2 and A3 are not explicit in , so it is not easy to check that a particular model satisfies those assumptions. We give some explicit sufficient conditions on the coefficients . For the next assumption, we use the Hölder spaces , as defined in Appendix A.4.
Assumption** (S).**
**
- a.
The transition kernel is a contraction mapping: there exists , such that . 2. b.
The flow is bounded: there exist two functions and such that, :
[TABLE] 3. c.
The jump rate is positive on and there exists , such that
[TABLE]
Then , and . 4. d.
The jump rate does not explode too soon: there exist two positive constants , such that and where
[TABLE]
These conditions ensure that Assumptions A2 and A3 are satisfied. The following two assumptions allow us to control the regularity of (the rate of convergence of the estimator depends on the regularity of , not on the regularity of ).
- e.
For any , . This ensures that and are continuous with respect to the Lebesgue measure on . 2. f.
There exists such that:
- •
* compact, , the function belongs to .*
- •
* compact, .*
- •
The transition measure can be written
[TABLE]
with, for any compact , and in , and invertible functions such that .
If Assumption (S) is satisfied, for fixed flow and transition measure , we can introduce the class of functions
[TABLE]
with and the convex set
[TABLE]
is defined by the recurrence:
[TABLE]
The following lemmas are proved in the Appendix.
Lemma 4**.**
Under Assumptions A1 and (S)
- a.
Assumption A2 is satisfied for : there exists , , for any function ,
[TABLE]
recalling that the inequality means that, for any , . 2. b.
Assumption A3 is satisfied. Moreover, there exists , such that
[TABLE]
Lemma 5**.**
If Assumptions A1 and (S) are satisfied, we can control the regularity of :
[TABLE]
Remark*.*
In [24], the author introduces the set of functions with very similar conditions. As she considers a transition measure deterministic, the sets and may not be equal. In particular, if , then there exists such that . On the contrary, if belongs to and the deterministic transition is , then for large enough, there exists such that . This is no longer the case if, for instance, . As the transition measure is unknown, it is not possible to exploit its characteristics.
Another difference between the two sets is that is estimated on the fixed interval and the assumptions depends on , whereas in [24], the interval of estimation depends on the set .
3 Estimation of the jump rate
3.1 The observation scheme
As in [3] and [24], the statistical inference is based on the observation scheme and asymptotics are considered when the number of jumps of the process, , goes to infinity. Actually the simpler observation scheme: is sufficient, as is known and one can remark that for all , .
3.2 Methodology
[24] and [3] construct a pointwise kernel estimator of before deriving an estimator of . Indeed, densities are often approximated by kernels methods (see Tsybakov [30] for instance). If the kernel is positive, the estimator is also a density. However, we want to control the risk of our estimator (not the pointwise risk), and also to construct an adaptive estimator. Estimators by projection are well adapted for estimation: if they are longer to compute at a single point than pointwise estimators, it is sufficient to know the estimated coefficients to construct the whole function. Furthermore, to find an adaptive estimator, we minimize a function of the norm of our estimator, that is the sum of the square of the coefficients, and the dimension. That is the reason why we choose an estimation by projection.
We first aim at estimating on the compact set . We construct a sequence of estimators by projection on an orthonormal basis. As usual in nonparametric estimation, their risks can be decomposed in a variance term and a bias term which depends of the regularity of the density function . We choose to use the Besov spaces (see Section A.4) to characterize the regularity, which are well adapted to estimation (particularly for the wavelet decomposition). The ”best” estimator is then selected by penalization. To construct the sequence of estimators, we introduce a sequence of vectorial subspaces . We construct an estimator of on each subspace and then select the best estimator .
Assumption A 4**.**
**
- a.
The subspaces are increasing and have finite dimension . 2. b.
The -norm and the -norm are connected:
[TABLE]
This implies that, for any orthonormal basis of ,
[TABLE] 3. c.
There exists a constant such that, for any , there exists an orthonormal basis such that:
[TABLE] 4. d.
There exists , called the regularity of the decomposition, such that:
[TABLE]
where is the orthogonal projection of on and is a Besov space (see Appendix A.4).
Conditions a, b and d are usual (see Comte et al. [12, section 2.3] for instance). They are satisfied for subspaces generated by wavelets, piecewise polynomials or trigonometric polynomials (see DeVore and Lorentz [15] for trigonometric polynomials and piecewise polynomials and Meyer [27] for wavelets). Condition c is necessary because we are not in the stationary case: it helps us to control some covariance terms. It is obviously satisfied for bounded bases (trigonometric polynomials), and localized bases (piecewise polynomials). Let us prove it for a wavelet basis. Let be a father wavelet function, then and . We get that . As is at least 0-regular, for , there exists a constant such that . Then and condition c is satisfied.
3.3 Estimation of the stationary density
Let us now construct an estimator of on the vectorial subspace . We consider an orthonormal basis of satisfying Assumption A4. Let us set
[TABLE]
The function is the orthogonal projection of on . We consider the estimator
[TABLE]
Proposition 6**.**
If , under Assumptions A1-A2 and A4,
[TABLE]
where and depends explicitly on , , .
When increases, the bias term decreases whereas the variance term increases. It is important to find a good bias-variance compromise. If belongs to the Besov space , then (see Assumption A4d). If , the risk is then minimum for and we have, for some continuous function :
[TABLE]
This is the usual nonparametric convergence rate (see Tsybakov [30]). If , then the risk is minimum for and the bias term is greater than the variance term. We can remark that a piecewise continuous function belongs to .
Let us now construct the adaptive estimator. We compute for . Our aim is to select automatically , without knowing the regularity of the stationary density . Let us introduce the contrast function . If , then we can write and
[TABLE]
The minimum is obtained for . Therefore
[TABLE]
As the subspaces are increasing, the function decreases when increases. To find an adaptive estimator, we need to add a penalty term . Let us set (or more generally , with , ) and choose
[TABLE]
We obtain an adaptive estimator .
Theorem 7** (Risk of the adaptive estimator).**
Under Assumptions A1-A2 and A4, , , ,
[TABLE]
where is a function of . We recall that .
The estimator is adaptive: it realizes the best bias-variance compromise, up to a multiplicative constant. We have an explicit rate of convergence if belongs to some (unknown) Besov space : in that case,
[TABLE]
and if ,
[TABLE]
for some continuous function .
3.4 Estimation of the jump rate
By (6), we have
[TABLE]
where is the stationary measure of .
Remark*.*
We notice that this formula is different as the one used in [24]
[TABLE]
where
[TABLE]
As in [24], the author works under the assumption that , the study was easier, here we need to consider the Markov chain .
To estimate the jump rate, we construct a quotient estimator. Let us consider the estimator
[TABLE]
where
[TABLE]
Remark*.*
As the process is observed continuously without errors, (and therefore ) is known on so is computable.
The estimator converges with nearly the same rate of convergence as :
Theorem 8**.**
[TABLE]
where
[TABLE]
The bias term depends of the regularity of the stationary density , not of the regularity of . If we consider and as functions of a Besov space, their regularities are not related: the Besov spaces are not stable by product (as they are subspaces of ). We would like to link the rate of convergence of to the regularity of rather than , at least when . In that case, belong to some Hölder space, which is stable by product, composition and integration. See Appendix A.4 for the definition and properties of Besov and Hölder spaces. We obtain the following corollary:
Corollary 9**.**
Under A1, (S) and A4, as soon as , for any ,
[TABLE]
Remark*.*
[24] obtain the same rate of convergence for a kernel estimator (with the regularity of known).
3.5 Minimax bound for the estimator of the jump rate
We have proved that, under assumptions A1, (S) and A4,
[TABLE]
We would like to verify that our estimator converges with the minimax rate of convergence, i.e:
[TABLE]
The factor comes from the quotient estimator, we can not expect it will stay in the minimax bound. Indeed, it is clear that one could replace in (11) by any function greater than . The best estimator will be obtained of course by taking and the risk of this estimator (unreachable as is unknown) will be proportional to .
Theorem 10** (Minimax bound).**
If A1, (S) and A4 are satisfied, then
[TABLE]
where the infimum is taken among all estimators.
4 Proofs
Lemmas 1, 3, 4 and 5 are proved in the Appendix.
4.1 Proof of Proposition 6
We have the following bias-variance decomposition:
[TABLE]
The estimator (and therefore its expectation ) belongs to the subspace . Then, by orthogonality
[TABLE]
The first terms are two terms of bias, the third is a variance term. Let us first bound the second term of bias. As the functions form an orthonormal basis of , we have
[TABLE]
By Lemma 1,
[TABLE]
As the and the -norms are connected (see Assumption A4b), and, since , we get:
[TABLE]
Let us now consider the variance term. As the functions form an orthonormal basis of , the integrated variance of is the sum of the variances of the coefficients :
[TABLE]
By Lemma 1, as , we get:
[TABLE]
By Assumptions A4b and c, , , and . Therefore:
[TABLE]
where and depends only on , and .
4.2 Proof of Theorem 7
The number of coefficients in the adaptive estimator is random. If we are still able to control easily the bias term, we can not simply control the variance of our estimator by adding the variances of its coefficients. For any , by definition of (see (8) and (9)), we have the following inequality:
[TABLE]
with . Then
[TABLE]
We have that, for any function , . We apply this equality to and . Equation (13) becomes:
[TABLE]
The function belongs to the vectorial subspace . Therefore:
[TABLE]
where . As the sequence is increasing, is simply the largest of the two subspaces. By the inequality of arithmetic and geometric means,
[TABLE]
By the triangular inequality, , and:
[TABLE]
We can decompose the last term in a bias term and a variance term. Let us set:
[TABLE]
and . Then:
[TABLE]
By Assumption A4b, implies that (we recall that and are smaller than ). Then by Lemma 1,
[TABLE]
It remains to bound . The unit ball is random. We can not bound on it, we have to control the risk on the fixed balls . We can write:
[TABLE]
The Markov chain is exponentially -mixing with -mixing coefficient . The following lemma is deduced from the Berbee’s coupling lemma and a Talagrand inequality. It is proved in the appendix.
Lemma 11** (Talagrand’s inequality for -mixing variables).**
Let be a Markov chain exponentially -mixing, with -mixing coefficient . We choose with , . We have that . Let us consider
[TABLE]
If we can find a triplet (, and ) such that:
[TABLE]
then we have:
[TABLE]
where , , and are universal constants.
For the sake of simplicity, let us set and . By Assumption A4b,
[TABLE]
By Lemma 1,
[TABLE]
By Cauchy-Schwarz,
[TABLE]
and
[TABLE]
Then
[TABLE]
By Assumption A4b, , moreover, and then
[TABLE]
It remains to find such that . Let us introduce an orthonormal basis of satisfying Assumption A4. Then we can write As the function is linear:
[TABLE]
We can remark that (see equation (14)) and by consequence, . By (12):
[TABLE]
We can now apply Lemma 11 with
[TABLE]
For , we get
[TABLE]
As ,
[TABLE]
and therefore
[TABLE]
where and depend on . The second term can be made smaller than for large enough. The third is also smaller to thanks to the exponential term. Then
[TABLE]
All the dimensions are different, so . Moreover, as , . Then by (17),
[TABLE]
Collecting (15), (16) and (18), for any :
[TABLE]
All the constants involved in the bound of and (, , ) depends on , , and . Then there exists an continuous function such that is increasing and
[TABLE]
4.3 Proof of Theorem 8
Let us first control . As is a diffeomorphism, the function is bounded on . The function is bounded by a constant on :
[TABLE]
We have that
[TABLE]
with the stationary density of introduced in Assumption A2. By Lemma 1, we have
[TABLE]
and therefore
[TABLE]
For large enough, is smaller than ( is defined in Assumption A3) and then by Markov inequality,
[TABLE]
As is a positive function, and therefore, according to the definition of the estimator (see (11)),
[TABLE]
We can write:
[TABLE]
As by Assumption A3:
[TABLE]
[TABLE]
with .
4.4 Proof of Theorem 10
We use the reduction scheme described in Tsybakov [30, chapter 2]. By Markov inequality,
[TABLE]
Our aim is to show that
[TABLE]
Instead of searching an infimum on the whole class , we can limit ourselves to the finite set , such that
[TABLE]
Then
[TABLE]
We note the predictor
[TABLE]
By the triangular inequality, .
Consequently, as ,
[TABLE]
By (22), . Then setting ,
and therefore:
[TABLE]
We denote by the law of under . The following lemma is exactly Theorem 2.5 of Tsybakov [30].
Lemma 12**.**
Let us consider a series of functions such that:
- a.
The functions are sufficiently apart:
[TABLE] 2. b.
For all , the function belongs to the subspace . 3. c.
Absolute continuity: , . 4. d.
The distance between the measures of probabilities is not too large:
[TABLE]
with , and the -square divergence.
Then
[TABLE]
Step 1: Construction of .
Let us set
[TABLE]
with defined in (7). As is constant on , this function belongs to the Hölder space and (see Appendix A.4 for the definition of the Hölder space). It remains to ensure that it belongs to . If , then . If , then any function satisfies: . Let us assume that : in that case, there exists such that .
We consider a non-negative function , bounded, with support in and such that . We set , and, for , . We consider the functions with . The functions have support in . Moreover, by a change of variable , and . We consider the set of functions
[TABLE]
The cardinal of is . For two vectors with values in , the distance between two functions and is:
[TABLE]
As the series and have values in , the quantity
[TABLE]
is the Hamming distance between and . To apply Lemma 12, we need that, ,
[TABLE]
This is not the case if we take the whole (the minimal Hamming distance between two vectors and is 1). We need to extract a sub-series of functions. According to Tsybakov [30, Lemma 2.7] (bound of Varshamov-Gilbert), it is possible to extract a family of the set such that and
[TABLE]
As ,
[TABLE]
We define
[TABLE]
Then, for any , if , as , by (23),
[TABLE]
This is exactly the expected lower bound if we take .
Step 2: Functions belong to .
We already know that belongs to . Let us first compute the norm of on . Let us set . We have that . We compute the modulus of smoothness:
[TABLE]
and
[TABLE]
by the change of variable . The functions have disjoint supports. For any , there exists such that and . Then
[TABLE]
Therefore
[TABLE]
and . Moreover,
[TABLE]
and consequently . Then for sufficiently small. It remains to check that . For any :
- a.
As is non-negative, , . 2. b.
for small enough. 3. c.
.
Therefore for small enough.
Step 3: Absolute continuity.
We denote by the transition densities . As is a Markov process,
[TABLE]
By (3), we can rewrite: where
[TABLE]
and where , and
[TABLE]
The probability density is null if one of the is null, if one of the indicator function , or if one is smaller than ; then is absolutely continuous with respect to .
Step 4: The divergence.
As are equivalent measures, we have:
[TABLE]
Let us set . We can write:
[TABLE]
As is the transition density, for any , . Moreover, as ,
[TABLE]
This expression of the divergence enables us to approximate it more closely. Let us set
[TABLE]
As the support of is included in , we can remark that is null on and
[TABLE]
We bound the -divergence differently on and : where
[TABLE]
We have that and therefore, as ,
[TABLE]
By (26), we obtain, as the functions are supported in :
[TABLE]
and, as ,
[TABLE]
Then by (30) and as
[TABLE]
As on , we get by (25) that
[TABLE]
Moreover, on , . Then by (29) and (30), we get that
[TABLE]
Therefore and, by (27) and (28), we get by recurrence
[TABLE]
As ,
[TABLE]
By (24), and therefore,
[TABLE]
for small enough, which concludes the proof.
Acknowledgements
N. Krell was partly supported by the Agence Nationale de la Recherche PIECE 12-JS01-0006-01. The research of E. Schmisser was supported in part by the Labex CEMPI (ANR-11-LABX-0007-01)
Appendix A Technical proofs and results
A.1 Proof of Lemma 1
We consider a function such that ; we obtain the expected result by dividing by its -norm. According to Assumption A2,
[TABLE]
which proves the first inequality. Let us set . We have:
[TABLE]
We notice that:
[TABLE]
by Assumption A2. Therefore
[TABLE]
Let us bound the last term of (31). We can remark that is an inhomogeneous Markov chain. Therefore, for any , and by Assumption A2,
[TABLE]
Then
[TABLE]
As ,
[TABLE]
By Assumption A2, for any function ,
[TABLE]
Then
[TABLE]
[TABLE]
where depends only on , and and we recall that .
A.2 Proof of Lemma 3
Let be an event of . Then is a disjoint reunion of events where
[TABLE]
with and subsets of and . Then
[TABLE]
As is a Markov chain,
[TABLE]
To simplify the notations, let us set
[TABLE]
and . Then
[TABLE]
We regroup the :
[TABLE]
where . We can remark that and by the law of total probability, . We can apply Assumption A2 to the function :
[TABLE]
Then
[TABLE]
By Lemma 1,
[TABLE]
Therefore
[TABLE]
As ,
[TABLE]
with , .
A.3 Proof of Lemma 4
A.3.1 Assumption A2 is satisfied
Assumption (S)d implies Assumption A2a. To prove Assumption A2b and c, in analogy with [24], we apply the following result, which is Theorem 1.1 of Baxendale [7] written for a Markov chain on and a finite measure instead of a probability.
Result** (Sufficient conditions for ergodicity).**
Let us consider an homogeneous Markov chain on with transition probability . Under the following three conditions,
Minorization condition
There exist a set and a finite measure such that ,
[TABLE]
Strong aperiodicity condition
.
Drift condition
There exists a function and two constants , such that
[TABLE]
Then the process is recurrent positive and strongly ergodic, and has a unique stationary probability measure .
Moreover, there exists and depending only on , and such that, for any function ,
[TABLE]
Then Assumptions A2b and c are satisfied.
Let us check that its three conditions (minorization, strong aperiodicity and drift) are satisfied. We need to control the transition density. As is an (inhomogeneous) Markov chain, let us note
[TABLE]
Let us set
[TABLE]
Minorization condition
Let us set . For any , any , by Assumption (S)b, we have that
[TABLE]
By Assumption (S)d and c, for any ,
[TABLE]
and is a finite measure.
Strong aperiodicity condition
[TABLE]
For any , by Assumption (S) a, . Therefore
[TABLE]
Then .
Drift condition
For any , as is an (inhomogeneous) Markov chain,
[TABLE]
where . By Assumption (S)a, as is an increasing function, . Then by (2) and Assumption (S)b,
[TABLE]
Let us make the change of variable , then and
[TABLE]
Let us first bound this quantity for . By Assumption (S)c, for any ,
[TABLE]
As , and, for any ,
[TABLE]
We have that
[TABLE]
Therefore, for any , as is an increasing function,
[TABLE]
Then
[TABLE]
Moreover, by (34),
[TABLE]
and by (35),
[TABLE]
Therefore the three conditions (minorization, strong aperiodicity and drift) are satisfied, which gives Assumption A2.
A.3.2 Assumption A3 is satisfied
It remains to prove that Assumption A3 is satisfied. We recall that
[TABLE]
By equation (5), for any ,
[TABLE]
By equation (2), Assumption (S)b and d, for any ,
[TABLE]
Then
[TABLE]
It remains to bound away from 0.
As is the stationary density of , . Therefore, by Markov inequality, as is an increasing function,
[TABLE]
[TABLE]
As , and is an increasing function, there exists , and consequently, . Let us consider the sequence
[TABLE]
where . We can remark that
[TABLE]
As is the stationary density, for any ,
[TABLE]
As , and by (2),
[TABLE]
as . By Assumption (S)b and c, and are bounded by below and there exists a constant such that
[TABLE]
Therefore, as ,
[TABLE]
Let us set . We can note that
[TABLE]
and in particular, By recurrence, we obtain:
[TABLE]
Then by (37)
[TABLE]
which concludes the proof.
A.4 Besov and Hölder spaces
Definition 13** (Modulus of continuity).**
The modulus of continuity is defined by
[TABLE]
If is Lipschitz, the modulus of continuity is proportional to . If , then is constant: the modulus of continuity can not measure higher smoothness.
Definition 14** (Modulus of smoothness).**
If is a function on , we define its modulus of smoothness by
[TABLE]
We can remark that if is , then
[TABLE]
In particular, if with compact and if is Lipschitz, then . If is -Hölder-continuous, that is if , , then
[TABLE]
If is piecewise-continuous and -Hölder on the points of continuity, then
[TABLE]
The modulus of continuity and the modulus of smoothness are sub-linears:
[TABLE]
Definition 15** (Besov space).**
The Besov space is the set of functions:
[TABLE]
where . The norm is defined by: . We denote .
See DeVore and Lorentz [15] and Meyer [27] for more details. We use the Besov space to control the risk of the estimator of the stationary density .
Definition 16** (Hölder space).**
The Hölder space is the set of functions:
[TABLE]
where . We note and define the norm of the Hölder space and .
As noted before, : the Hölder space is included in which itself is included in . We can remark that if a function is and piecewise , it belongs to but only to .
A.5 Proof of Lemma 5
As is the stationary distribution of , by (4) and (3), we have, for all :
[TABLE]
with
[TABLE]
As the Hölder spaces are stables by multiplication, composition and integration, has the same regularity than and . We have that
[TABLE]
Let us set If is differentiable, we get:
[TABLE]
and if belongs to , there exist such that :
[TABLE]
It remains to study the regularity of the function .
We consider some particular transition measures in order to understand how the regularity of (and ) depends on the form and the regularity of .
Continuous transition measure
There exists a function such that , and we can write
[TABLE]
Moreover, as if , with , we get
[TABLE]
Furthermore, by definition of the Hölder semi-norm, for
[TABLE]
Then .
Deterministic transition measure
Let us assume that can be written with a bijection. As , . Then we have that
[TABLE]
If is differentiable:
[TABLE]
So we get:
[TABLE]
The regularity of on depends on the regularity of on and of and on . By recurrence, there exists a function such that
[TABLE]
where
[TABLE]
If is not a bijection (and ), then can be less regular than . Let us consider . Then
[TABLE]
Then is a piecewise constant function and is not differentiable. We can remark that
[TABLE]
is not differentiable.
If (which implies that the vectors are independent), then has the same regularity as . We can remark that is .
General case
Under Assumption (S),
[TABLE]
with invertible, therefore
[TABLE]
and
[TABLE]
Therefore, there exists a function such that
[TABLE]
As and , then , and there exists a continuous function such that
[TABLE]
which ends the proof.
A.6 Proof of Talagrand’s inequality for beta-mixing variables
The following lemma is very useful to replace weak dependent variables by variables which are independent by blocks. It is proved by Viennet [31, proof of Proposition 5.1].
Lemma 17** (Berbee’s coupling lemma).**
The random variables are exponentially -mixing. Let us set where characterizes the -mixing coefficient (see Definition 2). We have that . We set . There exist random vectors such that:
- •
* and have same law.*
- •
The random vectors are independent, as the random vectors
.
- •
For any integer , , .
Let us set . Then
[TABLE]
This following inequality comes from Talagrand’s inequalities (see Birgé and Massart [8, Corollary 2 p354]).
Lemma 18** (Talagrand’s inequality).**
Let be independent random variables and a vectorial subspace of finite dimension satisfying Assumption 4. We denote by a countable family of . Let us set
[TABLE]
with . If
[TABLE]
then
[TABLE]
where is a universal constant and .
Proof of lemma 18.
We apply Theorem 1.1 of Klein and Rio [23] to the functions (notation used in Theorem 1.1 of Klein and Rio [23]). We obtain that
[TABLE]
We modify this inequality following Corollary 2 of Birgé and Massart [8]. It gives:
[TABLE]
The end of the proof is done in Comte and Merlevède [11, p222-223]. ∎
Proof of lemma 11.
To deduce lemma 11, we simply apply the Berbee’s coupling lemma to exponential -mixing variables, and then the Talagrand’s inequality. Indeed, by Berbee’s coupling lemma, as and have same law:
[TABLE]
We first bound the second part of the sum . We have:
[TABLE]
By Cauchy-Schwartz, and by Berbee’s coupling lemma,
Let us now bound the first term . We have
[TABLE]
where and . The random variables are independent, the same can be said for . Moreover, and . Let us set
[TABLE]
We have: . Then,
[TABLE]
As the dimension of is finite, we can find a countable family dense in and we can then apply the Talagrand’s inequality to and which concludes the proof.
∎
Appendix B: Simulations
For the simulations, two very classical PDMP processes are considered: the TCP and the size of a marked bacteria.
TCP protocol.
The transmission control protocol (TCP) is one of the main data transmission protocol in the Internet. The maximum number of packets that can be sent at time in a round is a random variable . If the transmission is successful, then the maximum number of packets is increased by one: . If the transmission fails, then with . A correct scaling of this process leads to a piecewise deterministic Markov process with the characteristics:
[TABLE]
Then the function is constant: . Let us denote by a primitive of . By (2), we have:
[TABLE]
As is positive, its primitive is invertible and by a change of variable:
[TABLE]
Then follows an exponential law translated by and of parameter . Therefore, if we can find the inverse of the function , we can construct the sequence by recurrence:
[TABLE]
where are i.i.d. of law .
If with , then and we obtain
[TABLE]
This model satisfies Assumption (S). In order to have a model with a non-increasing function , we also consider the function with , . In that case, by (38),
[TABLE]
and, by Cardan’s formula, this equation has a unique real solution, which is
[TABLE]
where . This model also satisfies Assumption (S).
Bacterial growth.
We choose randomly a bacteria, and follow its growth, until it divides in two parts more or less equal. Then we choose randomly one of its daughter, and so on. Between the jumps, the bacteria grows exponentially. During a jump, the size of the bacteria is more or less divided by two. We model this by setting , where is a random variable independent of , in , and centered in . The Beta distribution satisfies these conditions. For , it is the uniform distribution, and when increases, the distribution is more concentrated around . We choose . Then
[TABLE]
Then and by (2),
[TABLE]
We need to find a primitive of . If , , then:
[TABLE]
Therefore
[TABLE]
and the law of the random variable is an exponential translated by and of parameter . Then
[TABLE]
with i.i.d. and i.i.d. All the conditions of Assumption (S) are satisfied, except point a. Indeed, , but there do not exists any such that . However, in the simulations, it seems that the process is ergodic and that A3 is satisfied.
Computations
For the two models, has a density with respect to the Lebesgue measure on , so it can be estimated on any compact interval , here to avoid edge effects. The estimator is computed thanks to a projection on a trigonometric basis. The constant involved in , cpen, should be greater than , with . The problem is that , a correlation term, is not easily tractable. We set cpen= for all models. This choice seems confirmed by the simulations results: the oracle remains close to 1.
The constant cpen could be determined via the slope heuristic. Indeed, if the constant in the penalty is too small, the algorithm selects the maximal dimension. If the penalty is large enough, it selects models of reasonable size. We then let the constant in the penalty vary and note the dimension selected. For smaller than a value , the largest models are selected, and for greater than , smaller models are chosen. The ”best” constant is . See Arlot and Massart [1] for instance.
Figure 2 shows the selected dimension with respect to cpen, the constant in the penalty. When the constant in the penalty increases, the chosen dimension first decreases very rapidly, until cpen=, then it decreases very slowly towards 1. Then . Our chosen penalty constant, , is a little greater than , and selects the same dimension (here 17).
However, the slope heuristic involves quite a lot of computations, so it can not be used for every simulation, only to check that the penalty constant is coherent.
In Figures 3 and 4, for each graph, five simulations of the PDMP with are realized. For each simulation, the estimator , the density and are drawn.
In the tables, 200 simulations for each 4-tuple are computed. The estimation interval is such that is greater than the threshold on for for all our models. For each set of parameters, the mean of the selected dimension , the mean and the standard variation of the error on , denoted by ”risk” and ”sd” are calculated. We also want to prove that our estimator is truly adaptive. As is unknown, we can not check that is the better choice for estimating . Instead, let us consider the estimator
[TABLE]
Then The optimal dimension is
[TABLE]
and the minimal risk . In the tables, we give the empirical means of , , the empirical mean and standard deviation of the risk and the empirical mean of the oracle
[TABLE]
In Figure 5, four simulations are realized, each for a different value of (, , and ) in order to show the convergence of our estimator.
Results
In Figures 3-4, the estimator is very close to , at least when is neither too small nor too large, that is when there are enough values to compute the estimator. The estimators and are quite smooth, whereas tends to oscillate. This is due to the division of two estimators. In Tables 3-4, the risk decreases when increases and seems to tend toward 0. The oracle remains close to 1, our estimator is really adaptive. When the number of observations is small, the risk may seem quite important (for instance, for figure 4 when ). This is simply because is smaller than the threshold (), and the estimator is set to 0 on some part of , or even on the whole interval. The estimation near 0 can be good for some models, for instance when and , because the random variables take smaller values (at a jump, we divide the process by 5 instead of by 2). The function then take higher values near 0, and the estimator is positive even for small values of . This problem is illustrated in Figure 5: when increases, the estimator is better both because the support interval of increases and because on the support interval, the estimator is closer to the true function.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Arlot and Massart [2009] S. Arlot and P. Massart. Data-driven calibration of penalties for least-squares regression. Journal of Machine Learning Research , 10:245–279, 2009.
- 2Azaïs and Genadot [2018] R. Azaïs and A. Genadot, A new characterization of the jump rate for piecewise-deterministic Markov processes with discrete transitions, Comm. Statist. Theory Methods , 47, (8):1812–1829, 2018
- 3Azaïs and Muller-Gueudin [2016] R. Azaïs and A. Muller-Gueudin. Optimal choice among a class of nonparametric estimators of the jump rate for piecewise-deterministic Markov processes. Electron. J. Stat. , 10(2):3648–3692, 2016. ISSN 1935-7524.
- 4Azaïs et al. [2014] R. Azaïs, J.-B. Bardet, A. Génadot, N. Krell, and P.-A. Zitt. Piecewise deterministic Markov process—recent results. In Journées MAS 2012 , volume 44 of ESAIM Proc. , pages 276–290. EDP Sci., Les Ulis, 2014.
- 5Azaïs et al. [2014] R. Azaïs, F. Dufour and A. Gégout-Petit, Non-parametric estimation of the conditional distribution of the interjumping times for piecewise-deterministic Markov processes, Scandinavian Journal of Statistics. Theory and Applications , 41, (4):950–969, 2014.
- 6Barron et al. [1999] A. Barron, L. Birgé, and P. Massart. Risk bounds for model selection via penalization. Probab. Theory Related Fields , 113(3):301–413, 1999. ISSN 0178-8051.
- 7Baxendale [2005] P. Baxendale Renewal theory and computable convergence rates for geometrically ergodic Markov chains, The Annals of Applied Probability , 15, (1B):700–738, 2005.
- 8Birgé and Massart [1998] L. Birgé and P. Massart. Minimum contrast estimators on sieves: exponential bounds and rates of convergence. Bernoulli , 4(3):329–375, 1998. ISSN 1350-7265.
