Projection Theorems and Estimating Equations for Power-Law Models
Atin Gayen, M. Ashok Kumar

TL;DR
This paper extends projection theorems for divergence measures to continuous models, simplifying estimation problems for power-law distributions like Student and Cauchy.
Contribution
It introduces regularity for generalized exponential models and applies projection theorems to solve estimation problems for specific power-law distributions.
Findings
Projection theorems are extended to continuous models.
Estimation problems for Student and Cauchy distributions are solved.
Regularity notion for generalized exponential models is introduced.
Abstract
We extend projection theorems concerning Hellinger and Jones et al. divergences to the continuous case. These projection theorems reduce certain estimation problems on generalized exponential models to linear problems. We introduce the notion of regularity for generalized exponential models and show that the projection theorems in this case are similar to the ones in discrete and canonical case. We also apply these ideas to solve certain estimation problems concerning Student and Cauchy distributions.
| Degrees of | Estimators by different methods | ||
|---|---|---|---|
| freedom | MLE | Jones et al. | |
| of the model | MLE | (without outliers) | estimator |
| 2.1733 | 0.1721 | 0.3674 | |
| () | |||
| 2.2730 | 0.2770 | 0.3186 | |
| () | |||
| Parameters | Estimators | Estimators using | ||
| () | using uniform | Epanechnikov | ||
| kernel | kernel | |||
| () | 0.7689 | 21.1981 | 0.7689 | 21.3493 |
| () | 0.1535 | 1.6649 | 0.1535 | 1.6844 |
| () | 0.0480 | 1.4575 | 0.0480 | 1.4567 |
| () | 0.0585 | 1.3061 | 0.0585 | 1.3049 |
| () | 0.0119 | 1.2707 | 0.0119 | 1.2736 |
| () | -0.0383 | 1.1307 | -0.0383 | 1.1312 |
| () | 0.0087 | 1.0961 | 0.0087 | 1.0919 |
| () | 0.0045 | 1.0760 | 0.0045 | 1.0737 |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Projection Theorems and Estimating Equations for Power-Law Models
Atin Gayen and M. Ashok Kumar
Discipline of Mathematics
Indian Institute of Technology Palakkad
Kerala 678557, India
Email: [email protected]; [email protected]
Abstract
We extend projection theorems concerning Hellinger and Jones et al. divergences to the continuous case. These projection theorems reduce certain estimation problems on generalized exponential models to linear problems. We introduce the notion of regularity for generalized exponential models and show that the projection theorems in this case are similar to the ones in discrete and canonical case. We also apply these ideas to solve certain estimation problems concerning Student and Cauchy distributions.
Index Terms:
Cauchy distribution, Divergence, Estimating equation, Power-law family, Projection theorem, Student distribution.
I Introduction
Divergence is a non-negative extended real-valued function defined for any pair of probability distributions satisfying if and only if . Minimum divergence (or distance) method is popular in statistical inference because of its many desirable properties including robustness and efficiency [6, 53]. Minimization of information divergence (-divergence) or relative entropy is closely related to the maximum likelihood estimation (MLE) [27, Lem. 3.1]. MLE is not a preferred method when the data is contaminated by outliers. However, -divergence can be extended by replacing the logarithmic function by some power function to produce divergences that are robust to outliers [5, 38, 18]. In this paper, we consider three such families of divergences that are well-known in the context of robust statistics. They are defined as follows.
Let and be probability distributions having a common support . Let .
- (a)
The Hellinger divergence (also known as Cressie-Read power divergence [19] or power divergence [54] and, up-to a monotone function, same as Rényi divergence [56]):
[TABLE] 2. (b)
The Basu et al. divergence (also known as power pseudo-distance [12, 13], density power divergence [5, 51, 39], -divergence [49]):
[TABLE] 3. (c)
The Jones et al. divergence [59, 46, 60, 32] (also known as relative -entropy [42, 43], Rényi pseudo-distance [13, 12], logarithmic density power divergence [47], projective power divergence [30], -divergence [32, 18]):
[TABLE]
Throughout the paper we assume that all the integrals are well defined over . The integrals are with respect to the Lebesgue measure on in the continuous case and with respect to the counting measure in the discrete case. Many well-known divergences fall in the above classes of divergences. For example, Chi-square divergence, Bhattacharyya distance [9] and Hellinger distance [7] fall in the -divergence class; Cauchy-Schwarz divergence [16, Eq. (2.90)] falls in the -divergence class; squared Euclidean distance falls in the -divergence class [5]. All three classes of divergences coincide with the -divergence as [18], where
[TABLE]
In this sense, each of these three classes of divergences can be regarded as a generalization of -divergence.
-divergences also arise as generalized cut-off rates in information theory [23]. -divergences belong to the Bregman class which is characterized by transitive projection rules [22, Eq. (3.2), Theorem 3], [39, Example 3]. -divergences (for ) arise in information theory as a redundancy measure in the mismatched cases of guessing [59], source coding [43] and encoding of tasks [15]. The three classes of divergences are closely related to robust estimation, for in case of and , and in case of , as we shall see now.
Let be an independent and identically distributed (i.i.d.) sample drawn from an unknown distribution . Let us suppose that is a member of a parametric family of probability distributions , where is an open subset of and all have a common support . MLE picks the distribution that would have most likely caused the sample. MLE solves the so-called score equation or estimating equation for , given by
[TABLE]
where , called the score function and stands for gradient with respect to . In the discrete case, the above equation can be re-written as
[TABLE]
where is the empirical measure of the sample .
Let us now suppose that the sample is from a mixture distribution of the form , , where is supposed to be a member of ; is regarded as the distribution of “true” samples and , that of outliers. Assume that support of is a subset of . While the usual MLE tries to fit a distribution for , robust estimation tries to fit for . Throughout the paper, the above will be the setup in all the estimation problems, unless otherwise stated. Thus for robust estimation, one needs to modify the estimating equation so that the effect of outliers is down-weighted. The following modified estimating equation, referred as generalized Hellinger estimating equation, was proposed in [4], where the score function was weighted by instead of in (6):
[TABLE]
where . This was proposed based on the following intuition. If x is an outlier, then will be smaller than for sufficiently smaller values of . Hence the terms corresponding to outliers in (7) are down-weighted (c.f. [6, Section 4.3] and the references therein).
Notice that (7) does not extend to continuous case due to the appearance of . However in literature, to avoid this technical difficulty, some smoothing techniques such as kernel density estimation [7, Section 3], [6, Section 3.1, 3.2.1], Basu-Lindsay approach [6, Section 3.5], Cao et al. modified approach [17] and so on are used for a continuous estimate of . The resulting estimating equation is of the form
[TABLE]
where is some continuous estimate of . To avoid this smoothing, Broniatowski et al. derived a duality technique where one first finds a dual representation for the Hellinger distance and then minimizes the empirical estimate of this dual representation to find the estimator. The empirical estimate of this dual representation does not require any smoothing. See [11, 62, 13, 12, 10, 50] for details.
The following estimating equation, where the score function is weighted by power of model density and equated to its hypothetical one, was proposed by Basu et al. [5]:
[TABLE]
where . Motivated by the works of Field and Smith [31] and Windham [68], an alternate estimating equation, where the weights are further normalized, was proposed by Jones et al. [38]:
[TABLE]
where . Notice that (9) and (10) do not require the use of empirical distribution. Hence no smoothing is required in these cases. The estimators of (8), (9) and (10) are consistent and asymptotically normal [5, Theorem 2], [38, Section 3], [7, Theorem 3]. They also satisfy two invariance properties, one when the underlying model is re-parameterized by a one-one function of the parameter [5, Section 3.4], and the other when the samples are replaced by some of their linear transformation [61, Theorem 3.1], [5, Section 3.4]. They coincide with the ML-estimating equation (5) when under the condition that . The estimating equations (5), (8), (9) and (10) are, respectively, associated with the divergences in (4), (1), (2), and (3) in a sense that will be made clear in the following.
Observe that the estimating equations (5), (8), (9), and (10) are implications of the first order optimality condition of maximizing, respectively, the usual log-likelihood function
[TABLE]
and the following generalized likelihood functions
[TABLE]
The above likelihood functions (12), (13) and (14) are not defined for . However it can be shown that they all coincide with as .
It is easy to see that the probability distribution that maximizes (12), (11), (13) or (14) is same as, respectively, the one that minimizes or the empirical estimates of , or . Thus for MLE or “robustified MLE,” one needs to solve
[TABLE]
where is either , , or ; when is or and when is . Notice that (8) for , (9) and (10) for , do not make sense in terms of robustness. However, they still serve as first order optimality condition for the divergence minimization problem (15). A probability distribution that attains the infimum is known as a reverse -projection of on .
A “dual” minimization problem is the so-called forward projection problem, where the minimization is over the first argument of the divergence function. Given a set of probability distributions with support and a probability distribution with the same support, any that attains
[TABLE]
is called a forward -projection of on . Forward projection is usually on a convex set or on an -convex set of probability distributions. Forward projection on a convex set is motivated by the well-known maximum entropy principle of statistical physics [36]. Motivation for forward projection on -convex set comes from the so-called non-extensive statistical physics [63, 65, 64, 42]. Forward -projection on convex set was extensively studied by Csiszár [20, 21, 24], Csiszár and Matúš [26, 25], Csiszár and Shields [27], and Csiszár and Tusnády [28].
The forward projections of either of the divergences in (1)-(4) on convex (or -convex) sets of probability distributions yield a parametric family of probability distributions. A reverse projection on this parametric family turns into a forward projection on the convex (or -convex) set, which further reduces to solving a system of linear equations. We call such a result a projection theorem of the divergence. These projection theorems were mainly due to an “orthogonal” relationship between the convex (or the -convex) family and the associated parametric family. The Pythagorean theorem of the associated divergence plays a key role in this context.
Projection theorem of the -divergence is due to Csiszár and Shields [27, pp. 24] where the convex family is a linear family and the associated parametric family is an exponential family. Projection theorem for -divergence was established by Kumar and Sundaresan [43, Theorem 18 and Theorem 21], where the so-called -power-law family (-family) plays the role of the exponential family. Projection theorem for -divergence was established by Kumar and Sason [41, Theorem 6], where a variant of the -power-law family, called -exponential family (-family), plays the role of the exponential family and the so-called -linear family plays the role of the linear family. Projection theorem for more general class of Bregman divergences, in which is a subclass, was established by Csiszár and Matúš [26] using techniques from convex analysis. (See also [52].) We observe that the parametric family associated with the projection theorem of -divergence is closely related to the -power-law family, which we call a -family.
Thus projection theorems enable us to find the estimator (MLE or any of the generalized estimators) as a forward projection if the estimation is done under a specific parametric family. While for MLE the required family is exponential, for the generalized estimations, it is one of the power-law families.
Our main contributions in this paper are the following.
The projection theorem for -divergence is known in the literature only for the discrete, canonical case. We first define the associated power-law family in a more general setup and establish projection theorem for on .
- 2.
We derive the projection theorem for -divergence on -family in more generality by establishing a one-to-one correspondence between this problem and the projection problem concerning -divergence on -family.
- 3.
We introduce the concept of regularity (full-rank family) for the power-law families , and . We also establish a close relationship among them.
- 4.
We show that the Cauchy distributions (also known as -Gaussian distributions [55, 66, 52, 34, 48]) are the escort distributions of the Student distributions [37], [30]. Also Cauchy and Student distributions, respectively, form regular and regular (and ) families.
- 5.
We find some generalized estimators for the location and scale parameters of the Student and Cauchy distributions using the projection theorems of the Jones et al. and Hellinger divergences. We also observe that these projection theorems can not be applied when the distributions are compactly supported. In this case the estimators should be found on a case by case basis. We find estimators in one such a case and compare it with MLE.
Rest of the paper is organized as follows. In Section II, we first generalize the power-law families to the continuous case and show that the Student and Cauchy distributions belong to this class. We also introduce the notion of regularity to these power-law families and establish the relationship among them in this section. In Section III, we establish projection theorems for the general power-law families. In Section IV, we apply projection theorems to Student and Cauchy distributions to find generalized estimators for their parameters. We also perform some simulations to analyze the efficacy of such estimators. We end the paper with a summary and concluding remarks in Section V. In the Appendix, we establish projection theorem of -divergence in the discrete case using elementary tools and identify the parametric family associated with this divergence.
II The power-law families: definition and examples
In this section, we define the power-law families associated with the projection theorems of the divergences , and in a more general set-up than they are studied in the literature. We also introduce the concept of regularity for these families. In the literature such a notion for exponential family has been studied, which sometimes is referred as full-rank family (see [44, 35]). We then make a comparison among these families. We also show that the well-known Student and Cauchy distributions can be expressed as regular power-law families.
II-A The -family
Motivation for -family comes from the forward projection of -divergence on a linear family (See (93)). Csiszár and Matúš [26] studied a more general form of this family in connection with the projection problems of Bregman divergences.
Definition 1
Consider a family of probability distributions on , where is an open subset of . Let be the support of (which may depend on ). Let and , where is differentiable for , for and . The family is said to form a -parameter -family characterized by and if
[TABLE]
for some differentiable function . Here is the normalizing factor that can be determined from .**
The family is said to be regular if, in addition, the following conditions are satisfied.
- (i)
support does not depend on the parameter , 2. (ii)
number of ’s equals the number of ’s, that is, , 3. (iii)
the functions 1, are linearly independent on , 4. (iv)
the functions are linearly independent on .
Further, it is said to be in canonical form if for . The natural parameter space in this case is given by the set of all such that on and .
Observe that -family is a special case of the family in [26, Eq. (28)] with and . Bashkirov [3, Eq. (15)] derived maximum Rényi entropy distribution subject to linear constraints on underlying probability distribution, as in (19), and called it to be in S-form. Naudts [51, Ex. 4] derived the canonical -family with as the ‘free energy’ minimizing distributions with respect to Tsallis entropy. We shall now see some examples of -family.
Example 1** (Student distributions)**
Let , be a symmetric, positive-definite matrix of order and . The -dimensional Student distribution with location parameter , scale parameter and degrees of freedom parameter , with when , is given by
[TABLE]
where for a real number , . The support of this distribution is given by
[TABLE]
and the normalizing factor
[TABLE]
*It should be noted that Student distributions are not defined for when as (20) is not integrable in this case. While these distributions do not have finite mean for , they do not have finite variance for . For all other values of , the mean and covariance matrix of these distributions are given by and respectively. Further, (20) coincides with a normal distribution when .
Let . Then and correspond to from the left and the right respectively. Let . Then (20) can be re-written as*
[TABLE]
where , and , the normalizing factor. Notice that the Student distribution with is not considered in (21) as corresponds to an infinite value of . For a matrix , we use the following notations.
[TABLE]
that is, is a column vector of dimension where its -th element is for . With these notations (21) can be re-written, for , as
[TABLE]
where equality (a) follows because is a scalar, (b) follows because , and (c) follows because . Comparing (1) with (19), we conclude that the Student distributions form a -parameter -family with
[TABLE]
where
[TABLE]
The distributions in (20) for were studied by Johnson and Vignat [37, Definition 1] as the maximizer of Rényi entropy under covariance constraint, where they classified them as Student-t when and Student-r when (see also [3]). For simplicity we just call them Student distributions. Observe that (20) for is the usual -dimensional -distribution.
Theorem 2
The Student distributions for (that is, ) form a regular -family.
Proof 1
Let be the inverse of . The characterizing functions ’s and ’s in (24) are given by
[TABLE]
such that
[TABLE]
and
[TABLE]
where
[TABLE]
Note that the number of ’s and ’s = , which is same as the number of unknown parameters ’s. Also , ’s and ’s are linearly independent on . Hence it remains to show only that , ’s and ’s are linearly independent on . Suppose that
[TABLE]
Dividing both sides by ,
[TABLE]
Taking partial derivative with respect to in (25),
[TABLE]
where is the zero vector in . Since , from (26) we must have Thus (25) becomes
[TABLE]
For , ,
[TABLE]
where k_{\theta}:=[cb_{\alpha}^{-1}(\alpha-1)N_{\theta,\alpha}^{1-\alpha}]\big{/}[2|\boldsymbol{\Sigma}^{-1}|] and denotes partial derivative with respect to . Thus differentiating (27) with respect to , for , , . Using these values in (27),
[TABLE]
Since is symmetric,
[TABLE]
Using this in (28),
[TABLE]
Since , then . This implies and thus for all , . Hence , ’s and ’s are linearly independent. This completes the proof.
Remark 1
Student distributions for do not form a regular -family as their support, in this case, depends on the unknown parameters.**
Example 2
Wigner semi-circle distributions [67] form a -family.**
II-B The -family
We now define the parametric family associated with the projection theorem of . Kumar and Sundaresan [43] studied this family in the discrete case.
Definition 3
Let and be as in Definition 1. The family of probability distributions is said to form a -parameter -power-law family or an -family characterized by and if
[TABLE]
*for some differentiable function . Here is the normalizing factor which is given by .
Bashkirov [3] derived a specific form of (31) in connection with Rényi entropy maximization and called it to be in Z-form.
The family is said to be regular if, along with (i)-(iii) of Definition 1, also the functions are linearly independent on . Further, it is said to be canonical if for . The natural parameter space of this family is the set of all such that on and .
Example 3
The Student distributions in (21) can be re-written as
[TABLE]
Let . Note that if . However, when , we consider the restricted parameter space such that . Thus (3) can be re-written, for , as
[TABLE]
Comparing (33) and (31), we see that Student distributions form a -parameter -family with
[TABLE]
where
[TABLE]
This suggests a close relationship between and families. In the following, we elucidate this fact in more details.
Remark 2
- (a)
* can be expressed as a : *Any as in (31) can be re-written, for , as
[TABLE]
with , \widetilde{w}(\theta)=\big{[}Z(\theta)^{\alpha-1},Z(\theta)^{\alpha-1}w_{1}(\theta),\ldots,Z(\theta)^{\alpha-1}w_{s}(\theta)\big{]}^{\top} and = \big{[}h({\bf{x}}), f_{s}({\bf{x}})\big{]}^{\top}. This implies that these also form a -parameter -family but characterized by and .
- (b)
* can be expressed as an *: Any as in (19) can be re-written, for , as
[TABLE]
with , and , or
[TABLE]
with , \widetilde{w}(\theta)=\big{[}1/F(\theta),w_{1}(\theta)/F(\theta),\ldots,w_{s}(\theta)/F(\theta)\big{]}^{\top} and , , , provided . This implies that forms a -parameter -family as in (36) or in (37). However, as before, the characterizing entities when we view it as a member of are not the same as we view it as .
- (c)
A regular may not be a regular : Notice that the number of ’s (and ’s) is increased when we expressed any member of a as an . Thus in general, (36) or (37) need not define a regular -family even if (19) defines a regular -family. This can be seen in the following example. Consider the 1-dimensional Student distributions with unit variance and :
[TABLE]
where is the normalizing factor which is independent of the unknown parameter . This can be viewed as a regular -family as
[TABLE]
with , , and . Observe that (38) can be re-written as an -family as
[TABLE]
with , , , , and . However, this does not define a regular as number of ’s (which is two) is not equal to the number of unknown parameters (which is one).
- (d)
The normalizing factor in a -family may take negative values: Unlike the normalizing factor in , in an may take negative values for some (see Example 8 and [43, Ex. 3] for a comparison). **
In the following, we find conditions under which a regular can be expressed as a regular .
Proposition 4
A regular -family as in Definition 1 with being a non-zero constant also forms a regular -family characterized by the same functions and ’s, if for and one of the following conditions holds.
- (a)
* is identically a constant, or*
- (b)
, , are linearly independent.
Proof 2
Consider a regular -family with being identically a constant. Then from Definition 1, for , we have
[TABLE]
(39) can be re-written as
[TABLE]
where S(\theta):=1+[F(\theta)\big{/}h]. Comparing (40) with (31) we see that ’s form an -family characterized by , . This family is regular if are linearly independent. Let
[TABLE]
for some scalars , . Using the value of , we get
[TABLE]
If is identically a constant then , since are linearly independent. Otherwise also , if , are linearly independent.
In the view of above proposition, we now show that Student distributions also form a regular -family.
Corollary 5
Student distributions for (that is, ) form a regular -family.
Proof 3
Recall that, for , Student distributions form a regular -family with (Theorem 2). Hence, in view of Proposition 4, these also form a regular -family if , , ’s and ’s as described in Example 1 are linearly independent. To see this, let
[TABLE]
for some and , where and ’s and ’s are as defined in Theorem 2. Note that and . Hence taking partial derivative with respect to in (41), we get
[TABLE]
where . Since , from (42), we have , which, further upon taking partial derivative with respect to , implies . Thus (41) reduces to (27) of Theorem 2. Hence proceeding as in Theorem 2, we get for , . This completes the proof.
Remark 3
Student distributions for do not form a regular as their support depends on the unknown parameters in this case.**
Example 4
Wigner semi-circle distributions also form an -family.**
II-C The -family
Next we define the parametric family . This is motivated by the work of Kumar and Sason [41] in connection with the forward -projections on -linear families where they dealt only the discrete distributions.
Definition 6
Let and be as in Definition 1. The family of probability distributions is said to form a -parameter -exponential family or an -family characterized by and if
[TABLE]
*for some differentiable function . Here is the normalizing factor given by Z(\theta)=1/\int_{\mathbb{S}}[h({\bf{x}})+w(\theta)^{\top}f({\bf{x}})\big{]}^{1/{(1-\alpha)}}d{\bf{x}}.
The family is said to be regular if, along with (i)-(iii) of Definition 1, also the functions are linearly independent on . Further, it is said to be canonical if for . The natural parameter space in this case is given by the set of all such that [h({\bf{x}})+w(\theta)^{\top}f({\bf{x}})\big{]}^{1/{(1-\alpha)}}>0 on and \int_{\mathbb{S}}[h({\bf{x}})+w(\theta)^{\top}f({\bf{x}})\big{]}^{1/{(1-\alpha)}}d{\bf{x}}<\infty.
Observe that, (45) with forms a -exponential family for studied in [51, 52, 48] (see also the references therein). However, if is not identically a constant, these two families are not the same.
Remark 4
Connection between and families:**
- (a)
Observe that (45) can be re-written, for , as
[TABLE]
Let . Thus an -family can be expressed as an -family characterized by the same entities, and vice-versa. This is also discussed in [64] for the specific case .
- (b)
and families are related through an escort transformation. When , such escort transformations are studied in the context of non-extensive statistical physics [65, 51]. Karthik and Sundaresan [40, Theorem 2] derived this connection for discrete, canonical families. We now extend this to the more general and families as in (31) and (45).
Lemma 7
Let . The map establishes a one-to-one correspondence between an -family characterized by and , and the -family characterized by the same entities, where p^{(\alpha)}({\bf{x}})={p({\bf{x}})^{\alpha}}\big{/}{\int p({\bf{y}})^{\alpha}d{\bf{y}}} is the -scaled measure (or the escort measure) associated with .
Proof 4
For any characterized by and , from (45) we have, for ,
[TABLE]
where and Z^{\prime}(\theta)=Z(\theta)^{\alpha}\big{/}\|p_{\theta}\|^{\alpha}. Hence characterized by the same functions and . So, the mapping is well-defined. The map is one-one, since it is easy to see that if for some then . To verify it is onto, let be arbitrary. Then, for ,
[TABLE]
which implies
[TABLE]
and hence
[TABLE]
Thus and so for some . It is now easy to show that . Thus for any characterized by and , there exists characterized by the same functions such that . Hence the mapping is onto.
We now find the -scaled Student distributions which form an -family, in view of Lemma 7.
Example 5** (Cauchy distributions)**
Let us consider the -dimensional Student distributions as in (21). The -scaled measure of is given by
[TABLE]
where
[TABLE]
is the normalizing factor and . Observe that is a valid density function for and it has full support for . Notice that in (47) can be re-written, for , as
[TABLE]
where . Using the notations , and , for , we have
[TABLE]
where \beta\in(d/(d-2),0)\cup(0,1)\cup(1,{(d+2)}\big{/}{d}) for and \beta\in(-\infty,0)\cup(0,1)\cup(1,{(d+2)}\big{/}{d}) for . Comparing (48) with (45), we see that ’s form a -parameter -family with
[TABLE]
where
[TABLE]
Some special cases of (48) include the following:
- (a)
The usual -dimensional Cauchy distributions correspond to .
- (b)
The generalized Cauchy distributions studied in [57] correspond to and .
- (c)
The multivariate truncated generalized Cauchy distributions studied in [1, Eq. (2.3)] correspond to where equals to the in their paper and .
*While studying the diffusion problem under Lévy distributions, Prato and Tsallis [55, Eq. (10)-(11)] found (48) as the maximizer of Rényi (or Tsallis) entropy subject to linear constraints on the -scaled measure of the distribution. In [66, 52, 34, 48], these distributions were studied as -Gaussian distributions. However, we shall call them simply Cauchy distributions with location parameter and scale parameter . ***
Observe that the functions and in Cauchy distribution as in (48) are the same as the ones in Student distribution (33). Thus by a similar argument as described in Corollary 5, we can show that Cauchy distributions form a -parameter regular -family for \beta\in(1,{(d+2)}\big{/}{d}). Note that for \beta\notin(1,{(d+2)}\big{/}{d}), they do not define a regular family because, in this case, the support depends on the unknown parameters.
Example 6
*Consider the Student distributions as in (33). In view of Remark 4(a), these form an -family, where (that is, when , and otherwise), characterized by the same functions as in (34). Student distributions as is studied in the literature, for example, in [52, 48]. Observe that, when , they indeed form a regular as this corresponds to in their form. ***
Remark 5
Consider an exponential family of probability distributions where, for ,
[TABLE]
Then each member of this family can also be re-written in any of the following equivalent forms:
[TABLE]
where . Analogous to (49), (50) and (51), respectively, the probability distributions in , and -families can be expressed, for , as
[TABLE]
where the -exponential function is defined as
[TABLE]
The -exponential function coincides with the usual exponential function as . Hence the families , and coincide with the usual exponential family as . Thus these three power-law families can be seen as generalizations of the exponential family. These power-law families are sometimes known as deformed exponential families (see, for example, [48]).**
III Projection theorems for general power-law families
In this section, we extend the projection theorems of , and divergences to the general power-law families by directly solving the associated estimating equations. We also find conditions under which the new projection theorems reduce to the ones as in the canonical case. We shall begin by recalling the projection theorems known in the literature. In the following, assume that the families are canonical and regular with support being finite and the parameter space being the natural parameter space. Let denote the empirical distribution of sample .
- (a)
Projection theorem for -divergence: Consider a -family characterized by and , where is a probability distribution with support . The reverse -projection of on satisfies
[TABLE]
where , for and denotes expected value with respect to . (See Theorem 23 and Remark 12). Ohara and Wada [52, Prop. 3] established (53) for the continuous case with being identically a constant. Csiszár and Matúš [26] studied this for the general Bregman divergences. 2. (b)
Projection theorem of -divergence: Consider an -family characterized by and , where is a probability distribution with support . The reverse -projection of on satisfies
[TABLE]
where . This is due to [43, Theorem 18 and Theorem 21]. 3. (c)
Projection theorem of -divergence: Consider an -family characterized by and , where is a probability distribution with support . The reverse -projection of on satisfies
[TABLE]
where denotes expectation with respect to , and are respectively averages of and with respect to . This is due to [41, Theorem 6].
Before we turn to the main results of this section we prove the following lemma and corollary that establish a connection between the generalized Hellinger and the Jones et al. estimating equations.
Lemma 8
The estimating equations (7) and (10) are the same up-to the transformation when is discrete. In the continuous case the same is true between (8) and (10) provided the empirical distribution is replaced by a continuous estimate and .
Proof 5
We present the proof for the discrete case. The proof for the continuous case follows by a similar argument if we replace by throughout in the proof.
The generalized Hellinger estimating equation (7) can be re-written as
[TABLE]
since . This can further be re-written as
[TABLE]
Observe that
[TABLE]
Hence
[TABLE]
where . Plugging (57) in (56),
[TABLE]
This is same as the Jones et al. estimating equation (10) with , and , respectively, replaced by , and .
This, together with Lemma 7, establishes the following equivalence between the Jones et al. estimation and generalized Hellinger estimation.
Corollary 9
Suppose that is an -exponential family characterized by where all the distributions have a common support . Then, under the assumptions of Lemma 8, solving the generalized Hellinger estimation problem under -family is equivalent to solving the Jones et al. estimation problem under the -family characterized by the same entities.
The following result extends the already known projection theorems of the divergences , and to the general power-law families as defined in Section II.
Theorem 10
Let be i.i.d. samples. Let be one of the families , , or and assume that support of does not depend on the parameter space . In (c), assume also that for . Then the following hold.
- (a)
Basu et al. estimator under must satisfy
[TABLE]
- (b)
Jones et al. estimator under must satisfy
[TABLE]
- (c)
Generalized Hellinger estimator under must satisfy
[TABLE]
Here \partial_{r}[w(\theta)]:=\big{[}\tfrac{\partial}{\partial\theta_{r}}[w_{1}(\theta)],\ldots,\tfrac{\partial}{\partial\theta_{r}}[w_{s}(\theta)]\big{]}^{\top} for . In (c), and , where is the empirical distribution in the discrete case; a suitable continuous estimate of in the continuous case.
Proof 6
(a) If then from Definition 1, for ,
[TABLE]
Taking derivative with respect to for ,
[TABLE]
The Basu et al. estimating equation (9) can be re-written as
[TABLE]
*Substituting (61) in (62), we get (58).
(b) If , using Definition 3, for ,
[TABLE]
Taking derivative with respect to for , we get
[TABLE]
Substituting this in the Jones et al. estimating equation (10),
[TABLE]
(c) This follows from (b) and Corollary 9.
Remark 6
- (a)
Jones et al. and generalized Hellinger estimation under :* Recall that a -family can be expressed as an -family as in (36) or (37). This implies that the Jones et al. estimator under -family satisfies (59) with , replaced by as defined in (36) or (37). Further, in view of Remark 4(a), (36) or (37) is also an -family where . Thus the -generalized Hellinger estimator under -family must satisfy (60) with , and replaced, respectively, by and .***
- (b)
Basu et al. and generalized Hellinger estimation under :* An -family can be expressed as a -family as in (35). Thus the Basu et al. estimator under an -family must satisfy (58) with and replaced by as defined in (35). Further, in view of Remark 4(a), the -generalized Hellinger estimator under -family must satisfy (60) with replaced by .***
- (c)
Basu et al. and Jones et al. estimation under :* An -family can be expressed as an -family as in (46), and hence can be expressed as a -family as in (35) with replaced by . Thus the -Jones et al. estimator under -family satisfies (59). Similarly the -Basu et al. estimator under -family satisfies (58) with and replaced by as in (35).
We now show that, when the families are regular, the projection equations in Theorem 10 reduce to the one as in the canonical case.
Corollary 11
The estimating equations (58), (59) and (60), respectively, reduce to (53), (54) and (55) if the underlying families are regular.
Proof 7
Let us first observe that for a regular family the matrix is non-singular for . To see this, let
[TABLE]
for some scalars and for each . Then
[TABLE]
for some constant . Now linear independence of , implies that . Consider a regular -family. Then from (58), we have
[TABLE]
Since is non-singular, (63) reduces to . Again (59) can be re-written as
[TABLE]
For a regular -family, this reduces to
[TABLE]
This implies
[TABLE]
That is,
[TABLE]
Hence
[TABLE]
Substituting this back in (64),
[TABLE]
In a similar fashion, the result for regular -family can be shown.
Theorem 10 fails if the support of the underlying family depends on the parameters. We show this by an example in Section IV-B.
Basu et al. estimating equation (9) differs from the Jones et al. estimating equation (10) in which the weights are normalized. Much research has been done to compare these two methods (for example, see [38]). We saw in Section II that a regular -family can be viewed as a regular -family under some conditions. In the following, we show that the two estimations coincide on a regular -family with being a non-zero constant (or on a regular -family with being a non-zero constant).
Theorem 12
For a regular -family with being identically a non-zero constant, Basu et al. estimating equation (58) and Jones et al. estimating equation (59) are the same.
Proof 8
Consider the -family as in (39). If it is regular, from Corollary 11, the Basu et al. estimating equation is given by
[TABLE]
We now show that the Jones et al. estimating equation (59) for this family is also the same. Recall that (39) can be written as an -family as in (40). The proof is divided into two parts.
- (i)
Suppose that is linearly independent with ’s. Then (40) forms a regular -family by Proposition 4. Therefore using Corollary 11, we see that the Jones et al. estimating equation (59) for (40) is same as (65), since is identically a constant.
- (ii)
Next let us suppose that is linearly dependent with ’s. Then there exists scalars (not all zero) such that
[TABLE]
Then
[TABLE]
Using Theorem 10(b), the Jones et al. estimating equation (59) for (40) is given, for , by
[TABLE]
Substituting the value of , an easy calculation yields, for ,
[TABLE]
Using (66) and (67) in (68),
[TABLE]
where . Since are linearly independent, is non-singular. Using this, (69) becomes
[TABLE]
Proceeding as in Corollary 11,
[TABLE]
That is,
[TABLE]
Hence .
Corollary 13
For a regular -family as in (31) with being identically a non-zero constant, the Basu et al. and the Jones et al. estimating equations are the same if is linearly independent with ’s.
IV Applications: Generalized estimation under Student and Cauchy distributions
In this section we find Jones et al. estimators [38] for the parameters of Student distribution for and generalized Hellinger estimators [6] of Cauchy distribution for . For the estimation of Cauchy distributions we use the kernel density estimate for the empirical measure. We also find a robust estimator of the mean parameter of Student distribution for the case when .
IV-A Basu et al. [5] and Jones et al. [38] estimation under Student distributions
In Theorem 2 we saw that for \alpha\in\big{(}(d-2)/d,1\big{)} (that is, for ) Student distributions form a -parameter regular -family with and . Hence to find the Basu et al. estimators of the parameters, its mean and variance should be finite. However, as we saw in Example 1, (1) does not have finite mean and variance for \alpha\in\big{(}(d-2)/d,d/(d+2)\big{]}. Hence we restrict ourselves to Student distributions for \alpha\in\big{(}d/(d+2),1\big{)}. The mean and the covariance of a Student distribution for \alpha\in\big{(}d/(d+2),1\big{)} are given by and respectively. Let be an i.i.d. sample where each for . Suppose also that the true distribution is a Student distribution as in (1). Using Corollary 11, the Basu et al. estimators of and are given, for and , by
[TABLE]
where .
Next consider the Student distributions as in (33) with \alpha\in\big{(}d/(d+2),1\big{)}. We saw that it forms a -parameter regular -family with . Hence, from Theorem 12, the Jones et al. estimators for and are the same as the Basu et al. estimators as in (70). In [30] and [33, Theorem 5] this was solved directly from the estimating equation (59). We summarize the above results in the following.
Theorem 14
For , the Basu et al. and the Jones et al. estimators of mean and covariance parameters of a -dimensional Student distribution as in (21) are the same and are given by (70).
Remark 7
It can be shown that, as , Student distributions coincide with a normal distribution with mean and covariance matrix [37]. Similarly, as , Basu et al. estimating equation or Jones et al. estimating equation becomes ML estimating equation (5). Thus there is a continuity of the generalized estimators of mean and covariance parameters as . This suggests that when the samples are from a Student distribution with sufficiently large , MLE of its parameters can be approximated by a generalized estimator of the respective parameters. (Note that the MLE of Student distributions do not have closed-from solution and numerical methods must be resorted to solve it [2, 29].) Simulations suggest that, for , generalized estimators (70) are close to MLE even for small sample size.**
For , the support of Student distributions depend on the parameters. Thus Theorem 10 can not be used to find the estimators. However, in this case, one can find the estimators by maximizing the respective likelihood function as described in the following.
IV-B Jones et al. estimation [38] under Student distributions for
For simplicity we deal only the one-dimensional case. Suppose that is an i.i.d. sample where . Suppose also that the true distribution is a Student distribution with some known (that is, ) and variance, say :
[TABLE]
where is the normalizing factor. The support of is given by , where c_{\alpha}:=\sqrt{-1\big{/}b_{\alpha}}. (Recall that for ). Observe that (71) defines an -family whose support depends on the unknown parameter. We now show that the Jones et al. estimator of could be different from . Since the support of depends on , we cannot apply Theorem 10. Hence we resort to the maximization of the associated likelihood function:
[TABLE]
The likelihood function, for (71), becomes
[TABLE]
where denotes the indicator function and , . Observe that the maximizer of is same as the maximizer of
[TABLE]
(Note that, Basu et al. likelihood function (13) for the model in (71) also reduces to (72); hence both the estimators are the same in this case.) It is clear from (72) that is positive if and only if lies at least in one for . Thus, to find the maximizer of , we only need to consider the cases when lies in one of the ’s.
Consider . If is disjoint from all other for , then equals to for . Similarly, if but all other ’s for , are disjoint from , then the value of in is given by:
[TABLE]
In general, if for some satisfy and all other ’s for are disjoint from , then can be divided into disjoint sub-intervals and in each of these sub-intervals the value of is given, for \mu\in\big{(}\cap_{i=1}^{j}I_{i}\big{)}\setminus\big{(}\cup_{i=j+1}^{k}I_{i}\big{)},~{}j\in\{1,\ldots,k\}, where , by
[TABLE]
Let us define . Then for , we can write
[TABLE]
Let . Then proceeding as above, for , we have
[TABLE]
In general, let I_{k}^{\prime}:=I_{k}\setminus\big{(}\cup_{i=1}^{k-1}I_{i}\big{)}, . Then for ,
[TABLE]
Hence for , , we can divide into at most sub-intervals I_{k}^{j}:=\big{[}\big{(}I_{k}^{\prime}\cap(\cap_{i=k}^{j}I_{i})\big{)}\setminus\big{(}\cup_{i=j+1}^{n}I_{i}\big{)}\big{]}, , such that the indicator functions in (74) will be positive for in either of these sub-intervals. For example, in Figure 1 we considered a case where can be divided into three disjoint sub-intervals, namely , and , and can be divided into two disjoint sub-intervals and , and so on. The maximizer of for in each of these sub-intervals can be found in the following way.
Let and be such that . Then the following possible cases appear:
- (i)
and :
[TABLE]
- (ii)
and :
[TABLE]
- (iii)
and :
[TABLE]
- (iv)
and :
[TABLE]
Also from (74), for , . Since
[TABLE]
is monotone increasing for , and monotone decreasing otherwise. Thus the local maximizer of in any non-empty for is given by
[TABLE]
where and are respectively the left and right ends of the sub-interval as in (i)-(iv).
Figure 2 shows two different cases of maximizer of for ( here) in the sub-interval .
Observe that, in this process we divide into a finite number of non-empty disjoint sub-intervals such that is positive if and only if lies in one of these sub-intervals. In each of these sub-intervals, has a unique maximizer. Thus we have a finite number of local maximizers of in , and hence the global maximizer is one among these local maximizers. This implies that the global maximizer can be different from the sample mean.
Remark 8
Observe that when . Thus for , the length of the intervals for increases, and hence all the indicator functions in (72) become positive for any . This implies that the maximizer of is the usual sample mean . This complies with the case as the MLE of the mean parameter of a normal distribution is . Recall that the likelihood function coincides with the usual likelihood function and the Student distribution coincides with the normal distribution as .**
To demonstrate the algorithm, we generated the following random sample from the mixture , where is the Student distribution with , and :
[TABLE]
Consider . Then for all . Using the formula (79), we have the following local maximizers of in each of these sub-intervals , , of :
[TABLE]
Next consider . Then the indicator functions in (IV-B) are positive only when , that is, only for . Using (79), we get the maximizer of in is . Similarly, we have only one maximizer in each for and they respectively are:
[TABLE]
Next consider . Then for . We then have four local maximizers of in four sub-intervals of , namely for , and they respectively are , , . Similarly for , has only one local maximizer in each and they respectively are .
Comparing the values of at each of the local maximizers, we get as the global maximizer of . Hence the Jones et al. (or Basu et al.) estimator for is , which is different from . Also MLE and MLE after deleting the outliers are, respectively, 2.173 and 0.1721. We repeated this exercise with being a Student distribution with mean zero and (). The results are shown in Table I. We observe that Jones et al. (or Basu et al.) estimator is close to the true parameter (also to the ML-estimator without outliers) as gets close to from right.
IV-C Generalized Hellinger estimation under Cauchy distributions
Consider the Cauchy distributions as in (48). Here we find the generalized Hellinger estimators for their location and scale parameters using a kernel density estimate for the sample empirical measure. Let , be an i.i.d. sample. Let be a suitable continuous density estimator of the empirical measure . When \beta\in\big{(}1,(d+2)/d\big{)}, we saw that Cauchy distributions form a regular -family. Thus in this case we use Corollary 11 to estimate its parameters. But for \beta\notin\big{(}1,(d+2)/d\big{)}, the support of this distribution depends on the unknown parameters. In this case one can estimate the parameters by maximizing the associated likelihood function as we did for Student distributions in Section IV-B.
Let \beta\in\big{(}1,(d+2)/d\big{)}. The characterizing entities of Cauchy distributions as a regular are respectively , and (see Example 5). Using (55), we therefore have the following estimating equations
[TABLE]
Let us find and . In Example 5, we saw that . Hence . Thus
[TABLE]
where and . Using this in (81), we get
[TABLE]
for , . Thus generalized Hellinger estimators of the location and scale parameters are
[TABLE]
for and . We summarize these in the following theorem.
Theorem 15
Let \beta\in\big{(}1,(d+2)/d\big{)} and be a suitable continuous estimate of the sample empirical measure. Then the generalized Hellinger estimators of the location and scale parameters of a -dimensional Cauchy distribution are given by (84).
Remark 9
In Example 6, We saw that Student distributions form a regular -family for \alpha^{\prime}\in\big{(}1,(d+2)/d\big{)}. Thus one can do the -generalized Hellinger estimation on Student distributions as well.
Notice that the estimators in (84) involve a continuous estimate of . In the following we present examples where we use ‘kernel density estimation’ to find such and use it to find the estimators. In the literature the commonly used kernel to estimate the -dimensional empirical measure is of the following form (see [61]):
[TABLE]
where is a symmetric distribution on and is a sequence of real numbers with suitable properties, called bandwidth. These properties of the kernel and bandwidth influence the performance of the estimators greatly. There is no general theory in the literature to choose the right continuous estimate for a given problem. However authors like Beran [7], Tamura and Boos [61], and Simpson [58] imposed conditions on and so that the estimators perform better in their specific setting. We shall use the following two kernels, namely uniform kernel and Epanechnikov kernel, with bandwidth for to find the estimators. Both the kernels and satisfy the following conditions which guarantee the convergence of to the true density [61, Lem. 3.1] (see also [6, Sec. 3.3] and the references therein):
- (i)
is symmetric about [math] and has compact support.
- (ii)
and .
We first use the -dimensional uniform kernel which is defined as follows.
[TABLE]
Let us denote by \big{[}\textbf{X}_{i}-n^{-1/2d},\textbf{X}_{i}+n^{-1/2d}\big{]} for and call them rectangles. Assume that all these rectangles are disjoint (these are actually disjoint for sufficiently large ). Then from (85) we have
[TABLE]
That is,
[TABLE]
Thus is the uniform distribution on \bigcup_{i=1}^{n}\big{[}\textbf{X}_{i}-n^{-1/2d},\textbf{X}_{i}+n^{-1/2d}\big{]}. This implies that the -scaled distribution is the same as . Therefore, we have
[TABLE]
[TABLE]
and
[TABLE]
Hence the estimators for the parameters are given by
[TABLE]
for and , where if and equals to zero otherwise.
Next we find the estimators using the following -dimensional Epanechnikov kernel
[TABLE]
where denotes the Euclidean norm. We thus have
[TABLE]
This yields the same estimators for and as in (86) except the correction term which differs only up-to a scale factor. For example when , changes to for , to for , to for , and so on.
Observe that, for \beta\in\big{[}(d+4)/(d+2),(d+2)/d\big{)} (that is, ), Cauchy distributions do not have finite mean and variance. Hence the estimates (86) oscillate as one increases the sample size . In this case some other smoothing techniques or changing the bandwidth in Theorem 15 may produce a better estimator. A general theory for this is part of our future work. However for \beta\in\big{(}1,(d+4)/(d+2)\big{)}, these estimators are close to the true parameters. In Table II, we summarize the results of a simulation study where we find the estimators by taking average of 25 different sets of random samples of size drawn from a standard Cauchy distribution using both the kernels for different degrees of freedom. We observe that, for , the estimators are bounded, and as increases, their performances get better.
V Summary and concluding remarks
Projection theorems of Jones et al. () and Hellinger () divergences tell us that the reverse projection, respectively, on the power-law families and turns out to be a forward projection on a “simpler” (linear or -linear) family which, in turn, reduces to a linear problem on the underlying probability distribution. The applicability of these projection theorems known in the literature were limited as they dealt only discrete and canonical models. In this work, we first generalized the associated power-law families to a more general set-up including the continuous case. We observed that these two families are related through an escort transformation, apart from the transformation studied in [64]. We then introduced the notion of regularity for these power-law families analogous to the concept of regular exponential family (or full rank family). This makes these families unique from similar families studied in the literature, namely, the -exponential class defined in [51], the class defined in [26] and so on. We then extended the projection theorems of and to these general form of the power-law families by solving the respective estimating equations. We observed that, for regular families, the new estimating equations coincide with the respective projection equations, similar to the ones in the canonical case. We also observed that both the estimating equations were characterized by some specific statistics of the samples. Such a characterization is well-known in the literature for the pair MLE and regular exponential family (See, for example, [14, pp. 149–150]). We finally showed that the Student and Cauchy distributions respectively form a regular and , and they are the escort distributions of each other. We then applied the above projection theorems to find generalized estimators for their parameters. Interestingly, both the Basu et al. and Jones et al. estimators for the mean and covariance parameters of a -dimensional Student distribution for are the same as the MLE of the respective parameters of a normal distribution, where is the degrees of freedom. We also found a more general class of distributions that includes Student distributions for which both the estimations are the same (Theorem 12). In [30, Eq. (38)], it was shown that sample mean and sample variance are still the generalized estimators (Jones et al. or Basu et al.) for compactly supported Student distributions (that is, ), but with the assumption that all samples are from true distribution. We showed that, in the presence of outliers, generalized estimator for the mean parameter might differ from the sample mean (Section IV-B). Next we found a class of generalized Hellinger estimators for the location and scale parameters of Cauchy distributions that involve a continuous estimate for the sample empirical measure. In particular, we found the estimators by using the uniform and Epanechnikov kernel density estimates for the empirical measure. We summarized this by a simulation study where we observed that these estimators are close to the true value of the parameters of Cauchy distributions when their mean and variance are finite. It is well-known that the MLE for Student or Cauchy distributions do not have a closed form solution. To overcome this, standard iterative methods such as Newton-Raphson, Gauss-Newton, EM are used in the literature [2, 29]. However, the sequence of estimators in these iterative methods may converge to a local maximum and the rate of convergence is also slow [2, 45]. Later some generalized iterative methods such as ECM, ECME were proposed, for example, in [45], where the rate of convergence was made faster than the previous methods. But again, they converge only to a local maximum. These difficulties can be overcome by some of the generalized estimators that we studied in this paper as they are not only robust but also have closed form solution.
Acknowledgements
Atin Gayen is supported by an INSPIRE fellowship of the Department of Science and Technology, Govt. of India. Part of this work was carried out when the authors were with the Indian Institute of Technology Indore. The authors would like to thank Professor Arup Bose for his constructive comments. The authors would also like to thank Professor Michel Broniatowski for the discussions they had with him during his visit to India through the VAJRA programme of Govt. of India. The authors also thank the Editor, Associate Editor and the referees for their valuable comments that improved the presentation of the paper.
Appendix: Projection theorem of density power divergence
The Projection theorem and the Pythagorean property of the more general class of Bregman divergences were established by Csiszár and Matúš [26] using tools from convex analysis. The density power divergence is a subclass of the Bregman divergences. However, it is not easy to extract the results for the -divergence from [26]. Ohara and Wada [52] also studied this by considering a specific form of the associated parametric family. In this section we derive the projection results for the -divergence in the discrete case using some elementary tools. We must point out that the geometry of -divergence is quite a natural extension of that of -divergence. Let be a finite alphabet set and be the space of all probability distributions on . Then for any , from (2), the -divergence in the discrete case can be written as
[TABLE]
Let us also recall the definitions of reverse and forward projections given in (15) and (16). For , we shall denote the support of as . For , is defined as the union of support of members of .
We now show the Pythagorean inequality of -divergence in connection with the forward projection on a non-empty closed, convex set (hence compact, since is finite). Thus the existence of forward projection always guaranteed, since is lower semi-continuous [26, Lem. 2.12]. In the following we assume that .
Theorem 16
Let be the forward -projection of on a closed and convex set . Then
[TABLE]
Further if , .
Proof 9
Let and define, for and ,
[TABLE]
Since is convex, . By mean-value theorem, for each ,
[TABLE]
From (87), we have
[TABLE]
Therefore (90) implies
[TABLE]
Hence, as , we have
[TABLE]
which implies
[TABLE]
If , then there exists and such that but . Hence if , then the left-hand side of (91) goes to as , which contradicts (91). This proves the claim.
Remark 10
If , in general, . [43, Ex. 2] serves as a counterexample here as well. It follows from the following fact. Since is finite, the -divergence can be written as
[TABLE]
where is the uniform distribution on and denotes the cardinality of . This implies
[TABLE]
where , the Rényi entropy of of order . That is, forward -projection of the uniform distribution on is same as the maximizer of Rényi entropy on . The same is true when is replaced by or .
We will now present a situation when the equality holds in (88).
Definition 17
The linear family, determined by real valued functions on and real numbers , is defined as
[TABLE]
Theorem 18
Let be the forward -projection of on . The following hold.
- (a)
If then the Pythagorean equality holds, that is,
[TABLE] 2. (b)
If and then the Pythagorean equality (94) holds.
Proof 10
(a) Let be as in (89). Since , there exists such that for . Hence, proceeding as in Theorem 16, for every , there exists such that
[TABLE]
Hence we get (92) with a reversed inequality. Thus we have equality in (92). Hence we have (94).
(b) Similar to (a).
When , equality in (94) does not hold in general. In the following we present an example where the equality in (94) does not hold.
Example 7
Let , and
[TABLE]
In view of Remark 10 and [43, Ex. 2], we see that is the forward -projection of the uniform distribution on . However there exists , say , that satisfies only the strict inequality in (88). The issue here is that .
We now find an explicit expression of the forward -projection in both the cases and separately.
Theorem 19
Let and let be a linear family of probability distributions as in (93).
- (a)
If , the forward -projection of on satisfies
[TABLE]
with , where are some scalars and is a constant. 2. (b)
If , the forward -projection of on satisfies
[TABLE]
where , and are as in (a).
Proof 11
- (a)
The proof is similar to that for -divergence in **[27, Th. 3.2]**. The linear family in Definition 17 can be re-written as
[TABLE]
Let be the subspace of spanned by the vectors . Then every can be thought of a -dimensional vector in . Hence is a subspace of that contains a vector whose components are strictly positive since and . It follows that is spanned by its probability vectors. From (92) we see that (94) is equivalent to
[TABLE]
This implies that the vector
[TABLE]
Hence
[TABLE]
for some scalars . This implies (95) for appropriate choices of and .
- (b)
The proof of this case is similar to that of -divergence **[43, Th. 14(b)]**. The optimization problem concerning the forward -projection is
[TABLE]
Hence by **[8, Prop. 3.3.7]**, there exists Lagrange multipliers , and , respectively, associated with the above constraints such that, for ,
[TABLE]
Since
[TABLE]
(103) can be re-written as
[TABLE]
Multiplying both sides by and summing over all , we get
[TABLE]
For , from (105), we must have . Then, from (107), we have
[TABLE]
If , from (107) we get
[TABLE]
Combining (108) and (109) we get (96).
Theorem 19 suggests us to define a parametric family of probability distributions that is a generalization of the usual exponential family. We call it a -family. First we formally define this family and then show an orthogonality relationship between this family and the linear family. As a consequence we will also show that the reverse -projection on a -family is same as the forward projection on a linear family.
Definition 20
Let where for and where for be real valued function on . The -parameter canonical family of probability distributions characterized by and is defined by where
[TABLE]
for some and is the subset of for which .
Remark 11
- (a)
Observe that -family is a special case of the family in **[26, Eq. (28)]** with and .
- (b)
The family depends on the reference measure only in a loose manner in the sense that any other member of the family can play the role of . The change of reference measure only corresponds to a translation of the parameter space. (This fact is true for the -family **[43, Prop. 22]**.)
The following theorem and its corollary together establish an “orthogonality” relationship between the -family and the associated linear family.
Theorem 21
Let . Consider a -family as in Definition 20 and let be the corresponding linear family determined by the same functions and some constants as in (93). If is the forward -projection of on then we have the following:
- (a)
* and*
[TABLE] 2. (b)
Further, if , then .
Proof 12
By Theorem 19, the forward -projection of on is in . This implies that . Hence it suffices to prove the following:
- (i)
Every satisfies (94) with in place of . 2. (ii)
* is non-empty.*
We now proceed to prove both (i) and (ii).
(i) Let . As , this implies that there exists a sequence such that as . Since , we can write
[TABLE]
for some constants and . Now for any we have, from the definition of linear family, . Since , we also have . Multiplying both sides of (111) by and separately, we get
[TABLE]
and
[TABLE]
Combining the above two equations, we get
[TABLE]
As , the above becomes
[TABLE]
which is equivalent to (94).
(ii) Let be the forward -projection of on the linear family
[TABLE]
(see Figure 3).
By construction for any . Hence, since , we have . Since is also characterized by the same functions , we have for every . Hence limit of any convergent sub-sequence of belongs to . Thus is non-empty. This completes the proof.
Corollary 22
Let . Let and be characterized by the same functions . Then and
[TABLE]
Proof 13
By Theorem 21, we have . In view of Remark 11(b), notice that every member of has the same projection on , namely . Hence (94) holds for every . Thus we only need to prove (94) for every . Let . There exists such that . Hence for any , we have
[TABLE]
Since for a fixed , is continuous as a function from to , taking limit as on both sides of (113), we have
[TABLE]
This completes the proof.
Theorem 21 does not hold, in general, for as shown in the following example.
Example 8
Let and be as in Example 7. Then the associated -family is given by
[TABLE]
where , and . Then we have
[TABLE]
where p_{\theta}=\big{[}(\frac{1}{4}+\frac{17\theta}{4}),(\frac{1}{4}+\frac{\theta}{4}),(\frac{1}{4}-\frac{7\theta}{4}),(\frac{1}{4}-\frac{11\theta}{4})\big{]}^{t}op. If then . This implies , which is outside the range of . Hence .
The following theorem tells us that a reverse -projection on a -family can be turned into a forward -projection on the associated linear family. We shall refer this as the projection theorem for the -divergence. This theorem is analogous to the one for -divergence [27, Th. 3.3], -divergence [43, Th. 18] and -divergence [41, Th. 6].
Theorem 23
Let . Let . Let be the empirical probability measure of and let
[TABLE]
where . Let be the forward -projection of on . Then the following hold.
- (i)
If , then is the reverse -projection of on . 2. (ii)
If , then does not have a reverse -projection on . However, is the reverse -projection of on cl.
Proof 14
Let us first observe that is constructed so that . Since the families and are defined by the same functions , , by Corollary 22, we have and
[TABLE]
Hence it is clear that the minimizer of over is same as the minimizer of over (Notice that this statement is also true with replaced by ). But over is uniquely minimized by . Hence if , since minimum value of over is same as that of over , the later is not attained on .
Remark 12
Theorems 21, 23, and Corollary 22 continue to hold for as well if attention is restricted to probability measures with strictly positive components and the existence of is guaranteed.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] S. F. Ateya and E. A. Madhagi, “On multivariate truncated generalized cauchy distribution,” Stat. Papers , vol. 54, pp. 879–897, 2013.
- 2[2] V. D. Barnett, “Evaluation of the maximum-likelihood estimator where the likelihood equation has multiple roots,” Biometrika , vol. 53, pp. 151–165, 1966.
- 3[3] A. G. Bashkirov, “On maximum entropy principle, superstatistics, power-law distribution and rényi parameter,” Phys. A. , vol. 340, pp. 153–162, 2004.
- 4[4] A. Basu, S. Basu, and G. Chaudhury, “Robust minimum divergence procedures for count data models,” Sankhya: The Indian Journal of Statistic , vol. 59, pp. 11–27, 1997.
- 5[5] A. Basu, I. R. Harris, N. L. Hjort, and M. C. Jones, “Robust and efficient estimation by minimizing a density power divergence,” Biometrika , vol. 85, pp. 549–559, 1998.
- 6[6] A. Basu, H. Shioya, and C. Park, Statistical Inference: The Minimum Distance Approach . Chapman & Hall/ CRC Monographs on Statistics and Applied Probability 120, 2011.
- 7[7] R. Beran, “Minimum hellinger distance estimates for parametric models,” Ann. Statist. , vol. 5, pp. 445–463, 1977.
- 8[8] D. P. Bertsekas, Nonlinear Programming . 2nd ed. Belmont, MA: Athena Scientific, 2003.
