Projection Theorems and Estimating Equations for Power-Law Models

Atin Gayen; M. Ashok Kumar

arXiv:1905.01434·math.ST·January 27, 2021

Projection Theorems and Estimating Equations for Power-Law Models

Atin Gayen, M. Ashok Kumar

PDF

TL;DR

This paper extends projection theorems for divergence measures to continuous models, simplifying estimation problems for power-law distributions like Student and Cauchy.

Contribution

It introduces regularity for generalized exponential models and applies projection theorems to solve estimation problems for specific power-law distributions.

Findings

01

Projection theorems are extended to continuous models.

02

Estimation problems for Student and Cauchy distributions are solved.

03

Regularity notion for generalized exponential models is introduced.

Abstract

We extend projection theorems concerning Hellinger and Jones et al. divergences to the continuous case. These projection theorems reduce certain estimation problems on generalized exponential models to linear problems. We introduce the notion of regularity for generalized exponential models and show that the projection theorems in this case are similar to the ones in discrete and canonical case. We also apply these ideas to solve certain estimation problems concerning Student and Cauchy distributions.

Tables2

Table 1. TABLE I: Comparison of Jones et al. estimator of mean parameter with its ML-estimator for the mixture 0.8 p + 0.2 𝒩 ( 10 , 1 ) 0.8 𝑝 0.2 𝒩 10 1 0.8p+0.2\mathcal{N}(10,1) , where p 𝑝 p is the Student distribution with α = 2 𝛼 2 \alpha=2 , μ = 0 𝜇 0 \mu=0 and σ = 1 𝜎 1 \sigma=1 .

Degrees of	Estimators by different methods
freedom		MLE	Jones et al.
of the model	MLE	(without outliers)	estimator
$ν = - 3$	2.1733	0.1721	0.3674
( $α = 2$ )
$ν = - 5$	2.2730	0.2770	0.3186
( $α = 1.5$ )

Table 2. TABLE II: β 𝛽 \beta -Hellinger estimators of the location and scale parameters obtained by kernel density estimation method using uniform and Epanechnikov kernels with bandwidth h n = 1 / n subscript ℎ 𝑛 1 𝑛 h_{n}=1/\sqrt{n} and sample size n = 50 𝑛 50 n=50 .

Parameters	Estimators		Estimators using
( $μ_{T} = 0, σ_{T} = 1$ )	using uniform		Epanechnikov
	kernel		kernel
	$\hat{μ}$	$\hat{σ}$	$\hat{μ}$	$\hat{σ}$
$ν = 1$
( $β = 2$ )	0.7689	21.1981	0.7689	21.3493
$ν = 2.1$
( $β = 51 / 31$ )	0.1535	1.6649	0.1535	1.6844
$ν = 3$
( $β = 3 / 2$ )	0.0480	1.4575	0.0480	1.4567
$ν = 4$
( $β = 7 / 5$ )	0.0585	1.3061	0.0585	1.3049
$ν = 5$
( $β = 4 / 3$ )	0.0119	1.2707	0.0119	1.2736
$ν = 7$
( $β = 5 / 4$ )	-0.0383	1.1307	-0.0383	1.1312
$ν = 10$
( $β = 13 / 11$ )	0.0087	1.0961	0.0087	1.0919
$ν = 15$
( $β = 9 / 8$ )	0.0045	1.0760	0.0045	1.0737

Equations379

D_{α} (p, q) := \frac{1}{α - 1} (\int p (x)^{α} q (x)^{1 - α} d x - 1) .

D_{α} (p, q) := \frac{1}{α - 1} (\int p (x)^{α} q (x)^{1 - α} d x - 1) .

B_{α} (p, q) := \frac{α}{1 - α} \int p (x) q (x)^{α - 1} d x - \frac{1}{1 - α} \int p (x)^{α} d x + \int q (x)^{α} d x .

B_{α} (p, q) := \frac{α}{1 - α} \int p (x) q (x)^{α - 1} d x - \frac{1}{1 - α} \int p (x)^{α} d x + \int q (x)^{α} d x .

I_{α} (p, q) := \frac{α}{1 - α} ln \int p (x) q (x)^{α - 1} d x - \frac{1}{1 - α} ln \int p (x)^{α} d x + ln \int q (x)^{α} d x .

I_{α} (p, q) := \frac{α}{1 - α} ln \int p (x) q (x)^{α - 1} d x - \frac{1}{1 - α} ln \int p (x)^{α} d x + ln \int q (x)^{α} d x .

I (p, q) := \int p (x) ln \frac{p ( x )}{q ( x )} d x .

I (p, q) := \int p (x) ln \frac{p ( x )}{q ( x )} d x .

\frac{1}{n} j = 1 \sum n s (X_{j}; θ) = 0,

\frac{1}{n} j = 1 \sum n s (X_{j}; θ) = 0,

x \in S \sum p_{n} (x) s (x; θ) = 0,

x \in S \sum p_{n} (x) s (x; θ) = 0,

x \in S \sum p_{n} (x)^{α} p_{θ} (x)^{1 - α} s (x; θ) = 0,

x \in S \sum p_{n} (x)^{α} p_{θ} (x)^{1 - α} s (x; θ) = 0,

\int p_{n} (x)^{α} p_{θ} (x)^{1 - α} s (x; θ) d x = 0,

\int p_{n} (x)^{α} p_{θ} (x)^{1 - α} s (x; θ) d x = 0,

\frac{1}{n} j = 1 \sum n p_{θ} (X_{j})^{α - 1} s (X_{j}, θ) = \int p_{θ} (x)^{α} s (x, θ) d x,

\frac{1}{n} j = 1 \sum n p_{θ} (X_{j})^{α - 1} s (X_{j}, θ) = \int p_{θ} (x)^{α} s (x, θ) d x,

\frac{\frac{1}{n} j = 1 \sum n p _{θ} ( X _{j} ) ^{α - 1} s ( X _{j} ; θ )}{\frac{1}{n} j = 1 \sum n p _{θ} ( X _{j} ) ^{α - 1}} = \frac{\int p _{θ} ( x ) ^{α} s ( x ; θ ) d x}{\int p _{θ} ( x ) ^{α} d x},

\frac{\frac{1}{n} j = 1 \sum n p _{θ} ( X _{j} ) ^{α - 1} s ( X _{j} ; θ )}{\frac{1}{n} j = 1 \sum n p _{θ} ( X _{j} ) ^{α - 1}} = \frac{\int p _{θ} ( x ) ^{α} s ( x ; θ ) d x}{\int p _{θ} ( x ) ^{α} d x},

L (θ) := \frac{1}{n} j = 1 \sum n ln p_{θ} (X_{j}),

L (θ) := \frac{1}{n} j = 1 \sum n ln p_{θ} (X_{j}),

L_{1}^{(α)} (θ)

L_{1}^{(α)} (θ)

L_{2}^{(α)} (θ)

L_{3}^{(α)} (θ)

p_{θ} \in Π in f D (\overset{p}{ˉ}_{n}, p_{θ}),

p_{θ} \in Π in f D (\overset{p}{ˉ}_{n}, p_{θ}),

p \in C in f D (p, q)

p \in C in f D (p, q)

\displaystyle{p_{\theta}({\bf{x}})}=\left\{\begin{array}[]{ll}{\big{[}h({\bf{x}})+F(\theta)+w(\theta)^{\top}f({\bf{x}})\big{]}^{\frac{1}{\alpha-1}},}&\hbox{~{}}{\bf{x}}\in\mathbb{S},\\ {0,}&\hbox{~{}otherwise},\end{array}\right.

\displaystyle{p_{\theta}({\bf{x}})}=\left\{\begin{array}[]{ll}{\big{[}h({\bf{x}})+F(\theta)+w(\theta)^{\top}f({\bf{x}})\big{]}^{\frac{1}{\alpha-1}},}&\hbox{~{}}{\bf{x}}\in\mathbb{S},\\ {0,}&\hbox{~{}otherwise},\end{array}\right.

p_{{\boldsymbol{\mu}},\boldsymbol{\Sigma}}({\bf{x}})=N_{\boldsymbol{\Sigma},\nu}\Big{[}1+\frac{1}{\nu}({\bf{x}}-{\boldsymbol{\mu}})^{\top}\boldsymbol{\Sigma}^{-1}({\bf{x}}-{\boldsymbol{\mu}})\Big{]}_{+}^{-\frac{\nu+d}{2}},

p_{{\boldsymbol{\mu}},\boldsymbol{\Sigma}}({\bf{x}})=N_{\boldsymbol{\Sigma},\nu}\Big{[}1+\frac{1}{\nu}({\bf{x}}-{\boldsymbol{\mu}})^{\top}\boldsymbol{\Sigma}^{-1}({\bf{x}}-{\boldsymbol{\mu}})\Big{]}_{+}^{-\frac{\nu+d}{2}},

\mathbb{S}=\left\{\begin{array}[]{ll}{\big{\{}{\bf{x}}:({\bf{x}}-\boldsymbol{\mu})^{\top}\boldsymbol{\Sigma}^{-1}({\bf{x}}-\boldsymbol{\mu})<-\nu\big{\}},}&\hbox{~{}if~{}}\nu\in(-\infty,\min\{0,2-d\}),\\ {\mathbb{R}^{d}},&\hbox{~{}if~{}}\nu\in(0,\infty),\end{array}\right.

\mathbb{S}=\left\{\begin{array}[]{ll}{\big{\{}{\bf{x}}:({\bf{x}}-\boldsymbol{\mu})^{\top}\boldsymbol{\Sigma}^{-1}({\bf{x}}-\boldsymbol{\mu})<-\nu\big{\}},}&\hbox{~{}if~{}}\nu\in(-\infty,\min\{0,2-d\}),\\ {\mathbb{R}^{d}},&\hbox{~{}if~{}}\nu\in(0,\infty),\end{array}\right.

N_{\boldsymbol{\Sigma},\nu}:=\left\{\begin{array}[]{ll}{\frac{\Gamma(1-[\nu/2])}{\Gamma(1-[\nu+d]/2)(-\nu\pi)^{d/2}|\boldsymbol{\Sigma}|^{1/2}},}&\hbox{~{}if~{}}\nu\in(-\infty,\min\{0,2-d\}),\\ \\ {\frac{\Gamma([\nu+d]/2)}{\Gamma(\nu/2)(\nu\pi)^{d/2}|\boldsymbol{\Sigma}|^{1/2}},}&\hbox{~{}if~{}}\nu\in(0,\infty).\end{array}\right.

N_{\boldsymbol{\Sigma},\nu}:=\left\{\begin{array}[]{ll}{\frac{\Gamma(1-[\nu/2])}{\Gamma(1-[\nu+d]/2)(-\nu\pi)^{d/2}|\boldsymbol{\Sigma}|^{1/2}},}&\hbox{~{}if~{}}\nu\in(-\infty,\min\{0,2-d\}),\\ \\ {\frac{\Gamma([\nu+d]/2)}{\Gamma(\nu/2)(\nu\pi)^{d/2}|\boldsymbol{\Sigma}|^{1/2}},}&\hbox{~{}if~{}}\nu\in(0,\infty).\end{array}\right.

p_{\theta}({\bf{x}})=N_{\theta,\alpha}\big{[}1+b_{\alpha}({\bf{x}}-\boldsymbol{\mu})^{\top}\boldsymbol{\Sigma}^{-1}({\bf{x}}-\boldsymbol{\mu})\big{]}_{+}^{\frac{1}{\alpha-1}},

p_{\theta}({\bf{x}})=N_{\theta,\alpha}\big{[}1+b_{\alpha}({\bf{x}}-\boldsymbol{\mu})^{\top}\boldsymbol{\Sigma}^{-1}({\bf{x}}-\boldsymbol{\mu})\big{]}_{+}^{\frac{1}{\alpha-1}},

\displaystyle\begin{array}[]{ll}{\rm Tr}(A):=\sum\limits_{i=1}^{d}a_{ii},\quad\rm{vec}(A):=[a_{11},\dots,a_{1d},a_{21},\ldots,a_{2d},\ldots,a_{d1},\ldots,a_{dd}]^{\top},\end{array}

\displaystyle\begin{array}[]{ll}{\rm Tr}(A):=\sum\limits_{i=1}^{d}a_{ii},\quad\rm{vec}(A):=[a_{11},\dots,a_{1d},a_{21},\ldots,a_{2d},\ldots,a_{d1},\ldots,a_{dd}]^{\top},\end{array}

p_{θ} (x)

p_{θ} (x)

\displaystyle\stackrel{{\scriptstyle(a)}}{{=}}\big{[}N_{\theta,\alpha}^{\alpha-1}+b_{\alpha}N_{\theta,\alpha}^{\alpha-1}\{\rm{Tr}({\bf{x}}^{\top}\boldsymbol{\Sigma}^{-1}{\bf{x}})-2\boldsymbol{\mu}^{\top}\boldsymbol{\Sigma}^{-1}{\bf{x}}+\boldsymbol{\mu}^{\top}\boldsymbol{\Sigma}^{-1}\boldsymbol{\mu}\}\big{]}^{\frac{1}{\alpha-1}}

\displaystyle\stackrel{{\scriptstyle(b)}}{{=}}\big{[}N_{\theta,\alpha}^{\alpha-1}+b_{\alpha}N_{\theta,\alpha}^{\alpha-1}\{\rm{Tr}(\boldsymbol{\Sigma}^{-1}{\bf{x}}{\bf{x}}^{\top})-2\boldsymbol{\mu}^{\top}\boldsymbol{\Sigma}^{-1}{\bf{x}}+\boldsymbol{\mu}^{\top}\boldsymbol{\Sigma}^{-1}\boldsymbol{\mu}\}\big{]}^{\frac{1}{\alpha-1}}

\displaystyle\stackrel{{\scriptstyle(c)}}{{=}}\big{[}N_{\theta,\alpha}^{\alpha-1}+b_{\alpha}N_{\theta,\alpha}^{\alpha-1}\{\rm{vec}(\boldsymbol{\Sigma}^{-1})^{\top}\rm{vec}({\bf{x}}{\bf{x}}^{\top})-2\boldsymbol{\mu}^{\top}\boldsymbol{\Sigma}^{-1}{\bf{x}}+\boldsymbol{\mu}^{\top}\boldsymbol{\Sigma}^{-1}\boldsymbol{\mu}\}\big{]}^{\frac{1}{\alpha-1}}

\displaystyle=\big{[}1+(N^{\alpha-1}_{\theta,\alpha}+b_{\alpha}N^{\alpha-1}_{\theta,\alpha}\boldsymbol{\mu}^{\top}\boldsymbol{\Sigma}^{-1}\boldsymbol{\mu}-1)-2b_{\alpha}N^{\alpha-1}_{\theta,\alpha}(\boldsymbol{\Sigma}^{-1}\boldsymbol{\mu})^{\top}{\bf{x}}+b_{\alpha}N^{\alpha-1}_{\theta,\alpha}\rm{vec}(\boldsymbol{\Sigma}^{-1})^{\top}\rm{vec}({\bf{x}}{\bf{x}}^{\top})\big{]}^{\frac{1}{\alpha-1}},

\displaystyle\begin{array}[]{ll}{\theta=[\mu_{i},\sigma_{ij}]^{\top}_{i,j\in\{1,\ldots,d\},i\leq j},\quad F(\theta)=N^{\alpha-1}_{\theta,\alpha}+b_{\alpha}N^{\alpha-1}_{\theta,\alpha}\boldsymbol{\mu}^{\top}\boldsymbol{\Sigma}^{-1}\boldsymbol{\mu}-1,}\\ {h({\bf{x}})\equiv 1,\quad w(\theta)=\big{[}w^{(1)}(\theta),w^{(2)}(\theta)\big{]}^{\top},\quad f({\bf{x}})=\big{[}f^{(1)}({\bf{x}}),f^{(2)}({\bf{x}})\big{]}^{\top},}\end{array}

\displaystyle\begin{array}[]{ll}{\theta=[\mu_{i},\sigma_{ij}]^{\top}_{i,j\in\{1,\ldots,d\},i\leq j},\quad F(\theta)=N^{\alpha-1}_{\theta,\alpha}+b_{\alpha}N^{\alpha-1}_{\theta,\alpha}\boldsymbol{\mu}^{\top}\boldsymbol{\Sigma}^{-1}\boldsymbol{\mu}-1,}\\ {h({\bf{x}})\equiv 1,\quad w(\theta)=\big{[}w^{(1)}(\theta),w^{(2)}(\theta)\big{]}^{\top},\quad f({\bf{x}})=\big{[}f^{(1)}({\bf{x}}),f^{(2)}({\bf{x}})\big{]}^{\top},}\end{array}

\displaystyle\begin{array}[]{ll}{w^{(1)}(\theta)=-2b_{\alpha}N^{\alpha-1}_{\theta,\alpha}{\boldsymbol{\Sigma}}^{-1}{\boldsymbol{\mu}},\quad f^{(1)}({\bf{x}})={\bf{x}},}\quad{w^{(2)}(\theta)=b_{\alpha}N^{\alpha-1}_{\theta,\alpha}\rm{vec}({\boldsymbol{\Sigma}}^{-1}),\quad f^{(2)}({\bf{x}})=\rm{vec}({\bf{x}}{\bf{x}}^{\top})}.\end{array}

\displaystyle\begin{array}[]{ll}{w^{(1)}(\theta)=-2b_{\alpha}N^{\alpha-1}_{\theta,\alpha}{\boldsymbol{\Sigma}}^{-1}{\boldsymbol{\mu}},\quad f^{(1)}({\bf{x}})={\bf{x}},}\quad{w^{(2)}(\theta)=b_{\alpha}N^{\alpha-1}_{\theta,\alpha}\rm{vec}({\boldsymbol{\Sigma}}^{-1}),\quad f^{(2)}({\bf{x}})=\rm{vec}({\bf{x}}{\bf{x}}^{\top})}.\end{array}

w^{(1)} (θ) = [w_{1} (θ), \dots, w_{d} (θ)]^{⊤}, f^{(1)} (x) = [f_{1} (x), \dots, f_{d} (x)]^{⊤},

w^{(1)} (θ) = [w_{1} (θ), \dots, w_{d} (θ)]^{⊤}, f^{(1)} (x) = [f_{1} (x), \dots, f_{d} (x)]^{⊤},

w_{i} (θ) = - 2 b_{α} N_{θ, α}^{α - 1} j = 1 \sum d σ^{ij} μ_{j}, f_{i} (x) = x_{i}, for i \in {1, \dots, d},

w_{i} (θ) = - 2 b_{α} N_{θ, α}^{α - 1} j = 1 \sum d σ^{ij} μ_{j}, f_{i} (x) = x_{i}, for i \in {1, \dots, d},

w^{(2)} (θ) = [w_{ij} (θ)]_{i, j \in {1, \dots, d}, i \leq j}^{⊤}, f^{(2)} (x) = [f_{ij} (x)]_{i, j \in {1, \dots, d}, i \leq j}^{⊤},

w^{(2)} (θ) = [w_{ij} (θ)]_{i, j \in {1, \dots, d}, i \leq j}^{⊤}, f^{(2)} (x) = [f_{ij} (x)]_{i, j \in {1, \dots, d}, i \leq j}^{⊤},

w_{ij} (θ) = b_{α} N_{θ, α}^{α - 1} σ^{ij}, i, j \in {1, \dots, d}, i \leq j, f_{ii} (x) = x_{i}^{2}, f_{ij} (x) = 2 x_{i} x_{j}, i, j \in {1, \dots, d}, i < j .

w_{ij} (θ) = b_{α} N_{θ, α}^{α - 1} σ^{ij}, i, j \in {1, \dots, d}, i \leq j, f_{ii} (x) = x_{i}^{2}, f_{ij} (x) = 2 x_{i} x_{j}, i, j \in {1, \dots, d}, i < j .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Projection Theorems and Estimating Equations for Power-Law Models

Atin Gayen and M. Ashok Kumar

Discipline of Mathematics

Indian Institute of Technology Palakkad

Kerala 678557, India

Email: [email protected]; [email protected]

Abstract

We extend projection theorems concerning Hellinger and Jones et al. divergences to the continuous case. These projection theorems reduce certain estimation problems on generalized exponential models to linear problems. We introduce the notion of regularity for generalized exponential models and show that the projection theorems in this case are similar to the ones in discrete and canonical case. We also apply these ideas to solve certain estimation problems concerning Student and Cauchy distributions.

Index Terms:

Cauchy distribution, Divergence, Estimating equation, Power-law family, Projection theorem, Student distribution.

I Introduction

Divergence is a non-negative extended real-valued function $D$ defined for any pair of probability distributions $(p,q)$ satisfying $D(p,q)=0$ if and only if $p=q$ . Minimum divergence (or distance) method is popular in statistical inference because of its many desirable properties including robustness and efficiency [6, 53]. Minimization of information divergence ( $I$ -divergence) or relative entropy is closely related to the maximum likelihood estimation (MLE) [27, Lem. 3.1]. MLE is not a preferred method when the data is contaminated by outliers. However, $I$ -divergence can be extended by replacing the logarithmic function by some power function to produce divergences that are robust to outliers [5, 38, 18]. In this paper, we consider three such families of divergences that are well-known in the context of robust statistics. They are defined as follows.

Let $p$ and $q$ be probability distributions having a common support $\mathbb{S}\subseteq\mathbb{R}^{d}$ . Let $\alpha>0,\alpha\neq 1$ .

(a)

The Hellinger divergence $D_{\alpha}$ (also known as Cressie-Read power divergence [19] or power divergence [54] and, up-to a monotone function, same as Rényi divergence [56]):

[TABLE] 2. (b)

The Basu et al. divergence $B_{\alpha}$ (also known as power pseudo-distance [12, 13], density power divergence [5, 51, 39], $\beta$ -divergence [49]):

[TABLE] 3. (c)

The Jones et al. divergence $\mathscr{I}_{\alpha}$ [59, 46, 60, 32] (also known as relative $\alpha$ -entropy [42, 43], Rényi pseudo-distance [13, 12], logarithmic density power divergence [47], projective power divergence [30], $\gamma$ -divergence [32, 18]):

[TABLE]

Throughout the paper we assume that all the integrals are well defined over $\mathbb{S}$ . The integrals are with respect to the Lebesgue measure on $\mathbb{R}^{d}$ in the continuous case and with respect to the counting measure in the discrete case. Many well-known divergences fall in the above classes of divergences. For example, Chi-square divergence, Bhattacharyya distance [9] and Hellinger distance [7] fall in the $D_{\alpha}$ -divergence class; Cauchy-Schwarz divergence [16, Eq. (2.90)] falls in the $\mathscr{I}_{\alpha}$ -divergence class; squared Euclidean distance falls in the $B_{\alpha}$ -divergence class [5]. All three classes of divergences coincide with the $I$ -divergence as $\alpha\rightarrow 1$ [18], where

[TABLE]

In this sense, each of these three classes of divergences can be regarded as a generalization of $I$ -divergence.

$D_{\alpha}$ -divergences also arise as generalized cut-off rates in information theory [23]. $B_{\alpha}$ -divergences belong to the Bregman class which is characterized by transitive projection rules [22, Eq. (3.2), Theorem 3], [39, Example 3]. $\mathscr{I}_{\alpha}$ -divergences (for $\alpha<1$ ) arise in information theory as a redundancy measure in the mismatched cases of guessing [59], source coding [43] and encoding of tasks [15]. The three classes of divergences are closely related to robust estimation, for $\alpha>1$ in case of $B_{\alpha}$ and $\mathscr{I}_{\alpha}$ , and $\alpha<1$ in case of $D_{\alpha}$ , as we shall see now.

Let $\textbf{X}_{1},\ldots,\textbf{X}_{n}$ be an independent and identically distributed (i.i.d.) sample drawn from an unknown distribution $p$ . Let us suppose that $p$ is a member of a parametric family of probability distributions $\Pi=\{p_{\theta}:\theta\in\Theta\}$ , where $\Theta$ is an open subset of $\mathbb{R}^{k}$ and all $p_{\theta}$ have a common support $\mathbb{S}\subseteq\mathbb{R}^{d}$ . MLE picks the distribution $p_{\theta^{*}}\in\Pi$ that would have most likely caused the sample. MLE solves the so-called score equation or estimating equation for $\theta$ , given by

[TABLE]

where $s(\textbf{{x}};\theta):=\nabla\ln p_{\theta}(\textbf{{x}})$ , called the score function and $\nabla$ stands for gradient with respect to $\theta$ . In the discrete case, the above equation can be re-written as

[TABLE]

where $p_{n}$ is the empirical measure of the sample $\textbf{X}_{1},\dots,\textbf{X}_{n}$ .

Let us now suppose that the sample $\textbf{X}_{1},\ldots,\textbf{X}_{n}$ is from a mixture distribution of the form $p_{\epsilon}=(1-\epsilon)p+\epsilon\delta$ , $\epsilon\in[0,1)$ , where $p$ is supposed to be a member of $\Pi=\{p_{\theta}:\theta\in\Theta\}$ ; $p$ is regarded as the distribution of “true” samples and $\delta$ , that of outliers. Assume that support of $\delta$ is a subset of $\mathbb{S}$ . While the usual MLE tries to fit a distribution for $p_{\epsilon}$ , robust estimation tries to fit for $p_{\theta}$ . Throughout the paper, the above will be the setup in all the estimation problems, unless otherwise stated. Thus for robust estimation, one needs to modify the estimating equation so that the effect of outliers is down-weighted. The following modified estimating equation, referred as generalized Hellinger estimating equation, was proposed in [4], where the score function was weighted by $p_{n}(\textbf{x})^{\alpha}p_{\theta}(\textbf{x})^{1-\alpha}$ instead of $p_{n}(\textbf{x})$ in (6):

[TABLE]

where $\alpha\in(0,1)$ . This was proposed based on the following intuition. If x is an outlier, then $p_{n}(\textbf{x})^{\alpha}p_{\theta}(\textbf{x})^{1-\alpha}$ will be smaller than $p_{n}(\textbf{x})$ for sufficiently smaller values of $\alpha$ . Hence the terms corresponding to outliers in (7) are down-weighted (c.f. [6, Section 4.3] and the references therein).

Notice that (7) does not extend to continuous case due to the appearance of $p_{n}^{\alpha}$ . However in literature, to avoid this technical difficulty, some smoothing techniques such as kernel density estimation [7, Section 3], [6, Section 3.1, 3.2.1], Basu-Lindsay approach [6, Section 3.5], Cao et al. modified approach [17] and so on are used for a continuous estimate of $p_{n}$ . The resulting estimating equation is of the form

[TABLE]

where $\widetilde{p}_{n}$ is some continuous estimate of $p_{n}$ . To avoid this smoothing, Broniatowski et al. derived a duality technique where one first finds a dual representation for the Hellinger distance and then minimizes the empirical estimate of this dual representation to find the estimator. The empirical estimate of this dual representation does not require any smoothing. See [11, 62, 13, 12, 10, 50] for details.

The following estimating equation, where the score function is weighted by power of model density and equated to its hypothetical one, was proposed by Basu et al. [5]:

[TABLE]

where $\alpha>1$ . Motivated by the works of Field and Smith [31] and Windham [68], an alternate estimating equation, where the weights are further normalized, was proposed by Jones et al. [38]:

[TABLE]

where $\alpha>1$ . Notice that (9) and (10) do not require the use of empirical distribution. Hence no smoothing is required in these cases. The estimators of (8), (9) and (10) are consistent and asymptotically normal [5, Theorem 2], [38, Section 3], [7, Theorem 3]. They also satisfy two invariance properties, one when the underlying model is re-parameterized by a one-one function of the parameter [5, Section 3.4], and the other when the samples are replaced by some of their linear transformation [61, Theorem 3.1], [5, Section 3.4]. They coincide with the ML-estimating equation (5) when $\alpha=1$ under the condition that $\int p_{\theta}(\textbf{{x}})s(\textbf{{x}};\theta)d\textbf{{x}}=0$ . The estimating equations (5), (8), (9) and (10) are, respectively, associated with the divergences in (4), (1), (2), and (3) in a sense that will be made clear in the following.

Observe that the estimating equations (5), (8), (9), and (10) are implications of the first order optimality condition of maximizing, respectively, the usual log-likelihood function

[TABLE]

and the following generalized likelihood functions

[TABLE]

The above likelihood functions (12), (13) and (14) are not defined for $\alpha=1$ . However it can be shown that they all coincide with $L(\theta)$ as $\alpha\to 1$ .

It is easy to see that the probability distribution $p_{\theta}$ that maximizes (12), (11), (13) or (14) is same as, respectively, the one that minimizes $D_{\alpha}(\widetilde{p}_{n},p_{\theta})$ or the empirical estimates of $I(p,p_{\theta})$ , $B_{\alpha}(p_{\epsilon},p_{\theta})$ or $\mathscr{I}_{\alpha}(p_{\epsilon},p_{\theta})$ . Thus for MLE or “robustified MLE,” one needs to solve

[TABLE]

where $D$ is either $I$ , $D_{\alpha}$ , $B_{\alpha}$ or $\mathscr{I}_{\alpha}$ ; $\bar{p}_{n}=p_{n}$ when $D$ is $I,B_{\alpha}$ or $\mathscr{I}_{\alpha}$ and $\bar{p}_{n}=\widetilde{p}_{n}$ when $D$ is $D_{\alpha}$ . Notice that (8) for $\alpha>1$ , (9) and (10) for $\alpha<1$ , do not make sense in terms of robustness. However, they still serve as first order optimality condition for the divergence minimization problem (15). A probability distribution that attains the infimum is known as a reverse $D$ -projection of $\bar{p}_{n}$ on $\Pi$ .

A “dual” minimization problem is the so-called forward projection problem, where the minimization is over the first argument of the divergence function. Given a set $\mathbb{C}$ of probability distributions with support $\mathbb{S}$ and a probability distribution $q$ with the same support, any $p^{*}\in\mathbb{C}$ that attains

[TABLE]

is called a forward $D$ -projection of $q$ on $\mathbb{C}$ . Forward projection is usually on a convex set or on an $\alpha$ -convex set of probability distributions. Forward projection on a convex set is motivated by the well-known maximum entropy principle of statistical physics [36]. Motivation for forward projection on $\alpha$ -convex set comes from the so-called non-extensive statistical physics [63, 65, 64, 42]. Forward $I$ -projection on convex set was extensively studied by Csiszár [20, 21, 24], Csiszár and Matúš [26, 25], Csiszár and Shields [27], and Csiszár and Tusnády [28].

The forward projections of either of the divergences in (1)-(4) on convex (or $\alpha$ -convex) sets of probability distributions yield a parametric family of probability distributions. A reverse projection on this parametric family turns into a forward projection on the convex (or $\alpha$ -convex) set, which further reduces to solving a system of linear equations. We call such a result a projection theorem of the divergence. These projection theorems were mainly due to an “orthogonal” relationship between the convex (or the $\alpha$ -convex) family and the associated parametric family. The Pythagorean theorem of the associated divergence plays a key role in this context.

Projection theorem of the $I$ -divergence is due to Csiszár and Shields [27, pp. 24] where the convex family is a linear family and the associated parametric family is an exponential family. Projection theorem for $\mathscr{I}_{\alpha}$ -divergence was established by Kumar and Sundaresan [43, Theorem 18 and Theorem 21], where the so-called $\alpha$ -power-law family ( $\mathbb{M}^{(\alpha)}$ -family) plays the role of the exponential family. Projection theorem for $D_{\alpha}$ -divergence was established by Kumar and Sason [41, Theorem 6], where a variant of the $\alpha$ -power-law family, called $\alpha$ -exponential family ( $\mathscr{E}^{(\alpha)}$ -family), plays the role of the exponential family and the so-called $\alpha$ -linear family plays the role of the linear family. Projection theorem for more general class of Bregman divergences, in which $B_{\alpha}$ is a subclass, was established by Csiszár and Matúš [26] using techniques from convex analysis. (See also [52].) We observe that the parametric family associated with the projection theorem of $B_{\alpha}$ -divergence is closely related to the $\alpha$ -power-law family, which we call a $\mathbb{B}^{(\alpha)}$ -family.

Thus projection theorems enable us to find the estimator (MLE or any of the generalized estimators) as a forward projection if the estimation is done under a specific parametric family. While for MLE the required family is exponential, for the generalized estimations, it is one of the power-law families.

Our main contributions in this paper are the following.

The projection theorem for $\mathscr{I}_{\alpha}$ -divergence is known in the literature only for the discrete, canonical case. We first define the associated power-law family $\mathbb{M}^{(\alpha)}$ in a more general setup and establish projection theorem for $\mathscr{I}_{\alpha}$ on $\mathbb{M}^{(\alpha)}$ .

2.

We derive the projection theorem for $D_{\alpha}$ -divergence on $\mathscr{E}^{(\alpha)}$ -family in more generality by establishing a one-to-one correspondence between this problem and the projection problem concerning $\mathscr{I}_{\alpha}$ -divergence on $\mathbb{M}^{(\alpha)}$ -family.

3.

We introduce the concept of regularity (full-rank family) for the power-law families $\mathbb{B}^{(\alpha)}$ , $\mathbb{M}^{(\alpha)}$ and $\mathscr{E}^{(\alpha)}$ . We also establish a close relationship among them.

4.

We show that the Cauchy distributions (also known as $q$ -Gaussian distributions [55, 66, 52, 34, 48]) are the escort distributions of the Student distributions [37], [30]. Also Cauchy and Student distributions, respectively, form regular $\mathscr{E}^{(\alpha)}$ and regular $\mathbb{M}^{(\alpha)}$ (and $\mathbb{B}^{(\alpha)}$ ) families.

5.

We find some generalized estimators for the location and scale parameters of the Student and Cauchy distributions using the projection theorems of the Jones et al. and Hellinger divergences. We also observe that these projection theorems can not be applied when the distributions are compactly supported. In this case the estimators should be found on a case by case basis. We find estimators in one such a case and compare it with MLE.

Rest of the paper is organized as follows. In Section II, we first generalize the power-law families to the continuous case and show that the Student and Cauchy distributions belong to this class. We also introduce the notion of regularity to these power-law families and establish the relationship among them in this section. In Section III, we establish projection theorems for the general power-law families. In Section IV, we apply projection theorems to Student and Cauchy distributions to find generalized estimators for their parameters. We also perform some simulations to analyze the efficacy of such estimators. We end the paper with a summary and concluding remarks in Section V. In the Appendix, we establish projection theorem of $B_{\alpha}$ -divergence in the discrete case using elementary tools and identify the parametric family associated with this divergence.

II The power-law families: definition and examples

In this section, we define the power-law families associated with the projection theorems of the divergences $B_{\alpha}$ , $\mathscr{I}_{\alpha}$ and $D_{\alpha}$ in a more general set-up than they are studied in the literature. We also introduce the concept of regularity for these families. In the literature such a notion for exponential family has been studied, which sometimes is referred as full-rank family (see [44, 35]). We then make a comparison among these families. We also show that the well-known Student and Cauchy distributions can be expressed as regular power-law families.

II-A The $\mathbb{B}^{(\alpha)}$ -family

Motivation for $\mathbb{B}^{(\alpha)}$ -family comes from the forward projection of $B_{\alpha}$ -divergence on a linear family (See (93)). Csiszár and Matúš [26] studied a more general form of this family in connection with the projection problems of Bregman divergences.

Definition 1

Consider a family of probability distributions $\{p_{\theta}:\theta\in\Theta\}$ on $\mathbb{R}^{d}$ , where $\Theta$ is an open subset of $\mathbb{R}^{k}$ . Let $\mathbb{S}$ be the support of $p_{\theta}$ (which may depend on $\theta$ ). Let $w=[w_{1},\ldots,w_{s}]^{\top}$ and $f=$ $[f_{1},\ldots,$ $f_{s}]^{\top}$ , where $w_{i}:\Theta\to\mathbb{R}$ is differentiable for $i\in\{1,\ldots,s\}$ , $f_{i}:\mathbb{R}^{d}\to\mathbb{R}$ for $i\in\{1,\ldots,s\}$ and $h:\mathbb{R}^{d}\to\mathbb{R}$ . The family is said to form a $k$ -parameter $\mathbb{B}^{(\alpha)}$ -family characterized by $h,w,f,\Theta$ and $\mathbb{S}$ if

[TABLE]

for some differentiable function $F:\Theta\to\mathbb{R}$ . Here $F(\theta)$ is the normalizing factor that can be determined from $\int_{\mathbb{S}}[h({\bf{x}})+F(\theta)+w(\theta)^{\top}f({\bf{x}})]^{1/(\alpha-1)}d{\bf{x}}=1$ .**

The family is said to be regular if, in addition, the following conditions are satisfied.

(i)

support $\mathbb{S}$ does not depend on the parameter $\theta$ , 2. (ii)

number of $\theta_{i}$ ’s equals the number of $w_{i}$ ’s, that is, $s=k$ , 3. (iii)

the functions 1, $w_{1},\ldots,w_{s}$ are linearly independent on $\Theta$ , 4. (iv)

the functions $1,f_{1},\ldots,f_{s}$ are linearly independent on $\mathbb{S}$ .

Further, it is said to be in canonical form if $w_{i}(\theta)=\theta_{i}$ for $i\in\{1,\ldots,k\}$ . The natural parameter space in this case is given by the set of all $\theta\in\mathbb{R}^{k}$ such that $[h({\bf{x}})+F(\theta)+w(\theta)^{\top}f({\bf{x}})]^{1/(\alpha-1)}>0$ on $\mathbb{S}$ and $\int_{\mathbb{S}}[h({\bf{x}})+F(\theta)+w(\theta)^{\top}f({\bf{x}})]^{1/(\alpha-1)}d{\bf{x}}=1$ .

Observe that $\mathbb{B}^{(\alpha)}$ -family is a special case of the family $\mathcal{F}_{[\beta h]}$ in [26, Eq. (28)] with $h=q$ and $\beta(\cdot,t)=\frac{1}{\alpha-1}[t^{\alpha}-\alpha t+\alpha-t]$ . Bashkirov [3, Eq. (15)] derived maximum Rényi entropy distribution subject to linear constraints on underlying probability distribution, as in (19), and called it to be in S-form. Naudts [51, Ex. 4] derived the canonical $\mathbb{B}^{(\alpha)}$ -family with $h\equiv 1$ as the ‘free energy’ minimizing distributions with respect to Tsallis entropy. We shall now see some examples of $\mathbb{B}^{(\alpha)}$ -family.

Example 1 (Student distributions)

Let ${\boldsymbol{\mu}}:=[\mu_{1},\ldots,\mu_{d}]^{\top}\in\mathbb{R}^{d}$ , $\boldsymbol{\Sigma}:=(\sigma_{ij})$ be a symmetric, positive-definite matrix of order $d$ and $\nu\in\mathbb{R}\setminus\{0\}$ . The $d$ -dimensional Student distribution with location parameter ${\boldsymbol{\mu}}$ , scale parameter $\boldsymbol{\Sigma}$ and degrees of freedom parameter $\nu$ , with $\nu\notin[2-d,0]$ when $d\geq 3$ , is given by

[TABLE]

where for a real number $r$ , $[r]_{+}:=\max\{r,0\}$ . The support of this distribution is given by

[TABLE]

and the normalizing factor

[TABLE]

*It should be noted that Student distributions are not defined for $\nu\in[2-d,0]$ when $d\geq 3$ as (20) is not integrable in this case. While these distributions do not have finite mean for $\nu\in[0,1]$ , they do not have finite variance for $\nu\in[0,2]$ . For all other values of $\nu$ , the mean and covariance matrix of these distributions are given by $\boldsymbol{\mu}$ and $[\nu/(\nu-2)]\cdot\boldsymbol{\Sigma}$ respectively. Further, (20) coincides with a normal distribution when $|\nu|\to\infty$ .

Let $\alpha:=1-\frac{2}{\nu+d}$ . Then $\nu\to+\infty$ and $\nu\to-\infty$ correspond to $\alpha\to 1$ from the left and the right respectively. Let $\theta=[\mu_{i},\sigma_{ij}]^{\top}_{i,j\in\{1,\ldots,d\},i\leq j}$ . Then (20) can be re-written as*

[TABLE]

where $\alpha\in(-\infty,\min\{0,(d-2)/d\})\cup((d-2)/d,1)\cup(1,\infty)$ , $b_{\alpha}=1/\nu=(1-\alpha)/[2-d(1-\alpha)]$ and $N_{\theta,\alpha}=N_{\boldsymbol{\Sigma},\nu}$ , the normalizing factor. Notice that the Student distribution with $\nu=-d$ is not considered in (21) as $\nu=-d$ corresponds to an infinite value of $\alpha$ . For a matrix $A=(a_{ij})_{d\times d}$ , we use the following notations.

[TABLE]

that is, $\rm{vec}(A)$ is a column vector of dimension $d^{2}$ where its $[(i-1)d+j]$ -th element is $a_{ij}$ for $i,j\in\{1,\dots,d\}$ . With these notations (21) can be re-written, for ${\bf{x}}\in\mathbb{S}$ , as

[TABLE]

where equality (a) follows because ${\bf{x}}^{\top}\boldsymbol{\Sigma}^{-1}{\bf{x}}$ is a scalar, (b) follows because $\rm{Tr}(AB)=\rm{Tr}(BA)$ , and (c) follows because $\rm{Tr}(AB)=\rm{vec}(A)^{\top}\rm{vec}(B^{\top})$ . Comparing (1) with (19), we conclude that the Student distributions form a $d(d+3)/2$ -parameter $\mathbb{B}^{(\alpha)}$ -family with

[TABLE]

where

[TABLE]

The distributions in (20) for $\nu\in(-\infty,-d)\cup(2,\infty)$ were studied by Johnson and Vignat [37, Definition 1] as the maximizer of Rényi entropy under covariance constraint, where they classified them as Student-t when $\nu>2$ and Student-r when $\nu<-d$ (see also [3]). For simplicity we just call them Student distributions. Observe that (20) for $\nu>0$ is the usual $d$ -dimensional $t$ -distribution.

Theorem 2

The Student distributions for $\nu>0$ (that is, $\alpha\in({(d-2)}/{d},1)$ ) form a regular $\mathbb{B}^{(\alpha)}$ -family.

Proof 1

Let $\boldsymbol{\Sigma}^{-1}:=(\sigma^{ij})_{d\times d}$ be the inverse of $\boldsymbol{\Sigma}$ . The characterizing functions $w^{(i)}$ ’s and $f^{(i)}$ ’s in (24) are given by

[TABLE]

such that

[TABLE]

and

[TABLE]

where

[TABLE]

Note that the number of $w_{i}$ ’s and $w_{ij}$ ’s = $d+d+(d-1)+(d-2)+\cdots+1=d(d+3)/2$ , which is same as the number of unknown parameters $\theta_{i}$ ’s. Also $1$ , $f_{i}$ ’s and $f_{ij}$ ’s are linearly independent on $\mathbb{S}$ . Hence it remains to show only that $1$ , $w_{i}$ ’s and $w_{ij}$ ’s are linearly independent on $\Theta$ . Suppose that

[TABLE]

Dividing both sides by $b_{\alpha}N_{\theta,\alpha}^{\alpha-1}$ ,

[TABLE]

Taking partial derivative with respect to $\boldsymbol{\mu}$ in (25),

[TABLE]

where $\bf{0}$ is the zero vector in $\mathbb{R}^{d}$ . Since $|\Sigma^{-1}|\neq 0$ , from (26) we must have $c_{1}=\cdots=c_{d}=0.$ Thus (25) becomes

[TABLE]

For $i,j\in\{1,\ldots,d\}$ , $i\leq j$ ,

[TABLE]

where $k_{\theta}:=[cb_{\alpha}^{-1}(\alpha-1)N_{\theta,\alpha}^{1-\alpha}]\big{/}[2|\boldsymbol{\Sigma}^{-1}|]$ and $\partial_{\sigma^{ij}}$ denotes partial derivative with respect to $\sigma^{ij}$ . Thus differentiating (27) with respect to $\sigma^{ij}$ , for $i,j\in\{1,\ldots,d\}$ , $i\leq j$ , $c_{ij}=k_{\theta}\partial_{\sigma^{ij}}(|\boldsymbol{\Sigma}^{-1}|)$ . Using these values in (27),

[TABLE]

Since $\boldsymbol{\Sigma}^{-1}$ is symmetric,

[TABLE]

Using this in (28),

[TABLE]

Since $\alpha>{(d-2)}/{d}$ , then $c=0$ . This implies $k_{\theta}=0$ and thus $c_{ij}=0$ for all $i,j\in\{1,\ldots,d\}$ , $i\leq j$ . Hence $1$ , $w_{i}$ ’s and $w_{ij}$ ’s are linearly independent. This completes the proof.

Remark 1

Student distributions for $\nu<0$ do not form a regular $\mathbb{B}^{(\alpha)}$ -family as their support, in this case, depends on the unknown parameters.**

Example 2

Wigner semi-circle distributions [67] form a $\mathbb{B}^{(\alpha)}$ -family.**

II-B The $\mathbb{M}^{(\alpha)}$ -family

We now define the parametric family $\mathbb{M}^{(\alpha)}$ associated with the projection theorem of $\mathscr{I}_{\alpha}$ . Kumar and Sundaresan [43] studied this family in the discrete case.

Definition 3

Let $h,w,f,\Theta$ and $\mathbb{S}$ be as in Definition 1. The family of probability distributions $\{p_{\theta}:\theta\in\Theta\}$ is said to form a $k$ -parameter $\alpha$ -power-law family or an $\mathbb{M}^{(\alpha)}$ -family characterized by $h,w,f,\Theta$ and $\mathbb{S}$ if

[TABLE]

*for some differentiable function $Z:\Theta\to\mathbb{R}$ . Here $Z(\theta)$ is the normalizing factor which is given by $Z(\theta)=1/\int_{\mathbb{S}}[h({\bf{x}})+w(\theta)^{\top}f({\bf{x}})]^{1/(\alpha-1)}d{\bf{x}}$ .

Bashkirov [3] derived a specific form of (31) in connection with Rényi entropy maximization and called it to be in Z-form.

The family is said to be regular if, along with (i)-(iii) of Definition 1, also the functions $f_{1},\dots,f_{s}$ are linearly independent on $\mathbb{S}$ . Further, it is said to be canonical if $w_{i}(\theta)=\theta_{i}$ for $i\in\{1,\ldots,k\}$ . The natural parameter space of this family is the set of all $\theta\in\mathbb{R}^{k}$ such that $[h({\bf{x}})+w(\theta)^{\top}f({\bf{x}})]^{1/(\alpha-1)}>0$ on $\mathbb{S}$ and $\int_{\mathbb{S}}[h({\bf{x}})+w(\theta)^{\top}f({\bf{x}})]^{1/(\alpha-1)}d{\bf{x}}<\infty$ .

Example 3

The Student distributions in (21) can be re-written as

[TABLE]

Let $S(\theta):=1+b_{\alpha}\boldsymbol{\mu}^{\top}\boldsymbol{\Sigma}^{-1}\boldsymbol{\mu}$ . Note that $S(\theta)>0$ if $\alpha\in((d-2)/d,1)$ . However, when $\alpha\notin((d-2)/d,1)$ , we consider the restricted parameter space such that $S(\theta)>0$ . Thus (3) can be re-written, for $\bf{x}\in\mathbb{S}$ , as

[TABLE]

Comparing (33) and (31), we see that Student distributions form a $d(d+3)/2$ -parameter $\mathbb{M}^{(\alpha)}$ -family with

[TABLE]

where

[TABLE]

This suggests a close relationship between $\mathbb{M}^{(\alpha)}$ and $\mathbb{B}^{(\alpha)}$ families. In the following, we elucidate this fact in more details.

Remark 2

(a)

$\mathbb{M}^{(\alpha)}$ * can be expressed as a $\mathbb{B}^{(\alpha)}$ : *Any $p_{\theta}\in\mathbb{M}^{(\alpha)}$ as in (31) can be re-written, for $\bf{x}\in\mathbb{S}$ , as

[TABLE]

with $F(\theta)\equiv-1$ , $\widetilde{w}(\theta)=\big{[}Z(\theta)^{\alpha-1},Z(\theta)^{\alpha-1}w_{1}(\theta),\ldots,Z(\theta)^{\alpha-1}w_{s}(\theta)\big{]}^{\top}$ and $\widetilde{f}({\bf{x}})$ = $\big{[}h({\bf{x}})$ , $f_{1}({\bf{x}}),\ldots,$ $f_{s}({\bf{x}})\big{]}^{\top}$ . This implies that these $p_{\theta}$ also form a $k$ -parameter $\mathbb{B}^{(\alpha)}$ -family but characterized by $1,\widetilde{f}$ and $\widetilde{w}$ .

(b)

$\mathbb{B}^{(\alpha)}$ * can be expressed as an $\mathbb{M}^{(\alpha)}$ *: Any $p_{\theta}\in\mathbb{B}^{(\alpha)}$ as in (19) can be re-written, for ${\bf{x}}\in\mathbb{S}$ , as

[TABLE]

with $Z(\theta)\equiv 1$ , $\widetilde{w}(\theta)=[F(\theta),w_{1}(\theta),\ldots,w_{s}(\theta)]^{\top}$ and $\widetilde{f}({\bf{x}})$ $=$ $[1,f_{1}({\bf{x}}),\ldots,$ $f_{s}({\bf{x}})]^{\top}$ , or

[TABLE]

with $Z(\theta)=F(\theta)^{\frac{1}{\alpha-1}}$ , $\widetilde{w}(\theta)=\big{[}1/F(\theta),w_{1}(\theta)/F(\theta),\ldots,w_{s}(\theta)/F(\theta)\big{]}^{\top}$ and $\widetilde{f}({\bf{x}})=$ $[h({\bf{x}})$ , $f_{1}({\bf{x}})$ , $\ldots,f_{s}({\bf{x}})]^{\top}$ , provided $F(\theta)>0$ . This implies that $p_{\theta}$ forms a $k$ -parameter $\mathbb{M}^{(\alpha)}$ -family as in (36) or in (37). However, as before, the characterizing entities when we view it as a member of $\mathbb{M}^{(\alpha)}$ are not the same as we view it as $\mathbb{B}^{(\alpha)}$ .

(c)

A regular $\mathbb{B}^{(\alpha)}$ may not be a regular $\mathbb{M}^{(\alpha)}$ : Notice that the number of $w_{i}$ ’s (and $f_{i}$ ’s) is increased when we expressed any member of a $\mathbb{B}^{(\alpha)}$ as an $\mathbb{M}^{(\alpha)}$ . Thus in general, (36) or (37) need not define a regular $\mathbb{M}^{(\alpha)}$ -family even if (19) defines a regular $\mathbb{B}^{(\alpha)}$ -family. This can be seen in the following example. Consider the 1-dimensional Student distributions with unit variance and $1/3<\alpha<1$ :

[TABLE]

where $N_{\alpha}$ is the normalizing factor which is independent of the unknown parameter $\mu$ . This can be viewed as a regular $\mathbb{B}^{(\alpha)}$ -family as

[TABLE]

with $h(x)=N_{\alpha}^{\alpha-1}+N_{\alpha}^{\alpha-1}{b}_{\alpha}x^{2}$ , $F(\mu)=N_{\alpha}^{\alpha-1}{b}_{\alpha}\mu^{2}$ , $w_{1}(\mu)=-2N_{\alpha}^{\alpha-1}{b}_{\alpha}\mu$ and $f_{1}(x)=x$ . Observe that (38) can be re-written as an $\mathbb{M}^{(\alpha)}$ -family as

[TABLE]

with $h(x)=1+{b}_{\alpha}x^{2}$ , $Z(\mu)=N_{\alpha}$ , $w_{1}(\mu)={b}_{\alpha}\mu^{2}$ , $f_{1}(x)=1$ , $w_{2}(\mu)=-2{b}_{\alpha}\mu$ and $f_{2}(x)=x$ . However, this does not define a regular $\mathbb{M}^{(\alpha)}$ as number of $w_{i}$ ’s (which is two) is not equal to the number of unknown parameters (which is one).

(d)

The normalizing factor in a $\mathbb{B}^{(\alpha)}$ -family may take negative values: Unlike the normalizing factor $Z(\theta)$ in $\mathbb{M}^{(\alpha)}$ , $F(\theta)$ in an $\mathbb{B}^{(\alpha)}$ may take negative values for some $\theta$ (see Example 8 and [43, Ex. 3] for a comparison). **

In the following, we find conditions under which a regular $\mathbb{B}^{(\alpha)}$ can be expressed as a regular $\mathbb{M}^{(\alpha)}$ .

Proposition 4

A regular $\mathbb{B}^{(\alpha)}$ -family as in Definition 1 with $h$ being a non-zero constant also forms a regular $\mathbb{M}^{(\alpha)}$ -family characterized by the same functions $h$ and $f_{i}$ ’s, if $1+[F(\theta)/h]>0$ for $\theta\in\Theta$ and one of the following conditions holds.

(a)

$F(\theta)$ * is identically a constant, or*

(b)

$1$ , $F(\theta)$ , $w_{1}(\theta),\dots,w_{k}(\theta)$ are linearly independent.

Proof 2

Consider a regular $\mathbb{B}^{(\alpha)}$ -family with $h$ being identically a constant. Then from Definition 1, for $\textbf{x}\in\mathbb{S}$ , we have

[TABLE]

(39) can be re-written as

[TABLE]

where $S(\theta):=1+[F(\theta)\big{/}h]$ . Comparing (40) with (31) we see that $p_{\theta}$ ’s form an $\mathbb{M}^{(\alpha)}$ -family characterized by $h$ , $f_{1},\ldots,f_{k}$ . This family is regular if $1,w_{1}(\theta)/S(\theta),\ldots,w_{k}(\theta)/S(\theta)$ are linearly independent. Let

[TABLE]

for some scalars $c_{i}$ , $i\in\{0,\ldots,k\}$ . Using the value of $S(\theta)$ , we get

[TABLE]

If $F(\theta)$ is identically a constant then $c_{0}=c_{1}=\cdots=c_{k}=0$ , since $1,w_{1},\dots,w_{k}$ are linearly independent. Otherwise also $c_{0}=c_{1}=\cdots=c_{k}=0$ , if $1$ , $F(\theta),w_{1}(\theta),\ldots,w_{k}(\theta)$ are linearly independent.

In the view of above proposition, we now show that Student distributions also form a regular $\mathbb{M}^{(\alpha)}$ -family.

Corollary 5

Student distributions for $\nu>0$ (that is, $\alpha\in({(d-2)}/d,1)$ ) form a regular $\mathbb{M}^{(\alpha)}$ -family.

Proof 3

Recall that, for $\alpha\in({(d-2)}/d,1)$ , Student distributions form a regular $\mathbb{B}^{(\alpha)}$ -family with $h({\bf{x}})\equiv 1$ (Theorem 2). Hence, in view of Proposition 4, these also form a regular $\mathbb{M}^{(\alpha)}$ -family if $1$ , $F(\theta)$ , $w_{i}(\theta)$ ’s and $w_{ij}(\theta)$ ’s as described in Example 1 are linearly independent. To see this, let

[TABLE]

for some $c,c_{i}$ and $c_{ij}\in\mathbb{R}$ , where $F(\theta)=N^{\alpha-1}_{\theta,\alpha}+b_{\alpha}N^{\alpha-1}_{\theta,\alpha}\boldsymbol{\mu}^{\top}\boldsymbol{\Sigma}^{-1}\boldsymbol{\mu}-1$ and $w_{i}$ ’s and $w_{ij}$ ’s are as defined in Theorem 2. Note that $\partial_{\boldsymbol{\mu}}[\boldsymbol{\mu}^{\top}\boldsymbol{\Sigma}^{-1}\boldsymbol{\mu}]=2(\boldsymbol{\Sigma}^{-1}\boldsymbol{\mu})$ and $\partial_{\boldsymbol{\mu}}[(\boldsymbol{\Sigma}^{-1}\boldsymbol{\mu})^{\top}]=\boldsymbol{\Sigma}^{-1}$ . Hence taking partial derivative with respect to $\boldsymbol{\mu}$ in (41), we get

[TABLE]

where $\tilde{c}=[c_{1},\ldots,c_{d}]^{\top}$ . Since $|\boldsymbol{\Sigma}^{-1}|\neq 0$ , from (42), we have $c_{0}\boldsymbol{\mu}-\tilde{c}=\bf{0}$ , which, further upon taking partial derivative with respect to $\boldsymbol{\mu}$ , implies $c_{0}=c_{1}=\cdots=c_{d}=0$ . Thus (41) reduces to (27) of Theorem 2. Hence proceeding as in Theorem 2, we get $c_{ij}=0$ for $i,j\in\{1,\ldots,d\}$ , $i\leq j$ . This completes the proof.

Remark 3

Student distributions for $\nu\notin(0,\infty)$ do not form a regular $\mathbb{M}^{(\alpha)}$ as their support depends on the unknown parameters in this case.**

Example 4

Wigner semi-circle distributions also form an $\mathbb{M}^{(\alpha)}$ -family.**

II-C The $\mathscr{E}^{(\alpha)}$ -family

Next we define the parametric family $\mathscr{E}^{(\alpha)}$ . This is motivated by the work of Kumar and Sason [41] in connection with the forward $D_{\alpha}$ -projections on $\alpha$ -linear families where they dealt only the discrete distributions.

Definition 6

Let $h,w,f,\Theta$ and $\mathbb{S}$ be as in Definition 1. The family of probability distributions $\{p_{\theta}:\theta\in\Theta\}$ is said to form a $k$ -parameter $\alpha$ -exponential family or an $\mathscr{E}^{(\alpha)}$ -family characterized by $h,w,f,\Theta$ and $\mathbb{S}$ if

[TABLE]

*for some differentiable function $Z:\Theta\to\mathbb{R}$ . Here $Z(\theta)$ is the normalizing factor given by $Z(\theta)=1/\int_{\mathbb{S}}[h({\bf{x}})+w(\theta)^{\top}f({\bf{x}})\big{]}^{1/{(1-\alpha)}}d{\bf{x}}$ .

The family is said to be regular if, along with (i)-(iii) of Definition 1, also the functions $f_{1},\ldots,f_{s}$ are linearly independent on $\mathbb{S}$ . Further, it is said to be canonical if $w_{i}(\theta)=\theta_{i}$ for $i\in\{1,\ldots,k\}$ . The natural parameter space in this case is given by the set of all $\theta\in\mathbb{R}^{k}$ such that $[h({\bf{x}})+w(\theta)^{\top}f({\bf{x}})\big{]}^{1/{(1-\alpha)}}>0$ on $\mathbb{S}$ and $\int_{\mathbb{S}}[h({\bf{x}})+w(\theta)^{\top}f({\bf{x}})\big{]}^{1/{(1-\alpha)}}d{\bf{x}}<\infty$ .

Observe that, (45) with $h\equiv 1$ forms a $\phi$ -exponential family for $\phi(x)=x^{\alpha}$ studied in [51, 52, 48] (see also the references therein). However, if $h$ is not identically a constant, these two families are not the same.

Remark 4

Connection between $\mathscr{E}^{(\alpha)}$ and $\mathbb{M}^{(\alpha)}$ families:**

(a)

Observe that (45) can be re-written, for $\bf{x}\in\mathbb{S}$ , as

[TABLE]

Let $\alpha^{\prime}=2-\alpha$ . Thus an $\mathscr{E}^{(\alpha)}$ -family can be expressed as an $\mathbb{M}^{(\alpha^{\prime})}$ -family characterized by the same entities, and vice-versa. This is also discussed in [64] for the specific case $h\equiv 1$ .

(b)

$\mathbb{M}^{(\alpha)}$ and $\mathscr{E}^{(\alpha)}$ families are related through an escort transformation. When $h\equiv 1$ , such escort transformations are studied in the context of non-extensive statistical physics [65, 51]. Karthik and Sundaresan [40, Theorem 2] derived this connection for discrete, canonical families. We now extend this to the more general $\mathbb{M}^{(\alpha)}$ and $\mathscr{E}^{(\alpha)}$ families as in (31) and (45).

Lemma 7

Let $\alpha\neq 0$ . The map $p\mapsto p^{(\alpha)}$ establishes a one-to-one correspondence between an $\mathscr{E}^{(\alpha)}$ -family characterized by $h,f,w,\Theta$ and $\mathbb{S}$ , and the $\mathbb{M}^{(1/\alpha)}$ -family characterized by the same entities, where $p^{(\alpha)}({\bf{x}})={p({\bf{x}})^{\alpha}}\big{/}{\int p({\bf{y}})^{\alpha}d{\bf{y}}}$ is the $\alpha$ -scaled measure (or the escort measure) associated with $p$ .

Proof 4

For any $p_{\theta}\in\mathscr{E}^{(\alpha)}$ characterized by $h,f$ and $w$ , from (45) we have, for ${\bf{x}}\in\mathbb{S}$ ,

[TABLE]

where $\|p_{\theta}\|^{\alpha}=\int p({\bf{x}})^{\alpha}d{\bf{x}}$ and $Z^{\prime}(\theta)=Z(\theta)^{\alpha}\big{/}\|p_{\theta}\|^{\alpha}$ . Hence $p_{\theta}^{(\alpha)}\in\mathbb{M}^{(1/\alpha)}$ characterized by the same functions $h,f$ and $w$ . So, the mapping is well-defined. The map is one-one, since it is easy to see that if $p_{\theta}^{(\alpha)}=p_{\eta}^{(\alpha)}$ for some $\theta,\eta\in\Theta$ then $p_{\theta}=p_{\eta}$ . To verify it is onto, let $p\in\mathbb{M}^{(1/\alpha)}$ be arbitrary. Then, for ${\bf{x}}\in\mathbb{S}$ ,

[TABLE]

which implies

[TABLE]

and hence

[TABLE]

Thus $p^{(1/\alpha)}\in\mathscr{E}^{(\alpha)}$ and so $p^{(1/\alpha)}=p_{\theta}$ for some $\theta\in\Theta$ . It is now easy to show that $p_{\theta}^{(\alpha)}=p$ . Thus for any $p\in\mathbb{M}^{(\tiny{1/\alpha})}$ characterized by $h,f$ and $w$ , there exists $p_{\theta}\in\mathscr{E}^{(\alpha)}$ characterized by the same functions such that $p^{(\alpha)}_{\theta}=p$ . Hence the mapping is onto.

We now find the $\alpha$ -scaled Student distributions which form an $\mathscr{E}^{(1/\alpha)}$ -family, in view of Lemma 7.

Example 5 (Cauchy distributions)

Let us consider the $d$ -dimensional Student distributions $p_{\theta}$ as in (21). The $\alpha$ -scaled measure of $p_{\theta}$ is given by

[TABLE]

where

[TABLE]

is the normalizing factor and $\eta=[\mu_{i},\sigma_{ij}]^{\top}_{i,j\in\{1,\ldots,d\},i\leq j}$ . Observe that $q_{\eta}$ is a valid density function for $\alpha\in(-\infty,\min\{0,(d-2)/d\})\cup(d/(d+2),1)\cup(1,\infty)$ and it has full support for $\alpha\in(d/(d+2),1)$ . Notice that $q_{\eta}$ in (47) can be re-written, for $\bf{x}\in\mathbb{S}$ , as

[TABLE]

where $S(\eta):=1+b_{\alpha}\boldsymbol{\mu}^{\top}\boldsymbol{\Sigma}^{-1}\boldsymbol{\mu}$ . Using the notations $\beta=1/\alpha$ , $c_{\beta}=b_{1/\beta}$ and $M_{\eta,\beta}=\widetilde{N}_{\eta,{1/\beta}}$ , for ${\bf{x}}\in\mathbb{S}$ , we have

[TABLE]

where $\beta\in(d/(d-2),0)\cup(0,1)\cup(1,{(d+2)}\big{/}{d})$ for $d\leq 2$ and $\beta\in(-\infty,0)\cup(0,1)\cup(1,{(d+2)}\big{/}{d})$ for $d\geq 3$ . Comparing (48) with (45), we see that $q_{\eta}$ ’s form a $d(d+3)/2$ -parameter $\mathscr{E}^{(\beta)}$ -family with

[TABLE]

where

[TABLE]

Some special cases of (48) include the following:

(a)

The usual $d$ -dimensional Cauchy distributions correspond to $\beta=(d+3)/(d+1)$ .

(b)

The generalized Cauchy distributions studied in [57] correspond to $\beta=(1+\omega)/\omega$ and $\beta\in(1,3)$ .

(c)

The multivariate truncated generalized Cauchy distributions studied in [1, Eq. (2.3)] correspond to $\beta=1+2/(2\kappa+d)$ where $\kappa$ equals to the $\alpha$ in their paper and $\beta\in(1,(d+2)/d)$ .

*While studying the diffusion problem under Lévy distributions, Prato and Tsallis [55, Eq. (10)-(11)] found (48) as the maximizer of Rényi (or Tsallis) entropy subject to linear constraints on the $\alpha$ -scaled measure of the distribution. In [66, 52, 34, 48], these distributions were studied as $q$ -Gaussian distributions. However, we shall call them simply Cauchy distributions with location parameter $\boldsymbol{\mu}$ and scale parameter $\boldsymbol{\Sigma}$ . ***

Observe that the functions $w$ and $f$ in Cauchy distribution as in (48) are the same as the ones in Student distribution (33). Thus by a similar argument as described in Corollary 5, we can show that Cauchy distributions form a $d(d+3)/2$ -parameter regular $\mathscr{E}^{(\beta)}$ -family for $\beta\in(1,{(d+2)}\big{/}{d})$ . Note that for $\beta\notin(1,{(d+2)}\big{/}{d})$ , they do not define a regular family because, in this case, the support depends on the unknown parameters.

Example 6

*Consider the Student distributions as in (33). In view of Remark 4(a), these form an $\mathscr{E}^{(\alpha^{\prime})}$ -family, where $\alpha^{\prime}=2-\alpha$ (that is, $\alpha^{\prime}\in(-\infty,1)\cup(1,(d+2)/d)\cup((d+2)/d,\infty)$ when $d\leq 2$ , and $\alpha^{\prime}\in(-\infty,1)\cup(1,(d+2)/d)\cup(2,\infty)$ otherwise), characterized by the same functions as in (34). Student distributions as $\mathscr{E}^{(\alpha^{\prime})}$ is studied in the literature, for example, in [52, 48]. Observe that, when $\alpha^{\prime}\in(1,(d+2)/d)$ , they indeed form a regular $\mathscr{E}^{(\alpha^{\prime})}$ as this corresponds to $\alpha\in((d-2)/d,1)$ in their $\mathbb{M}^{(\alpha)}$ form. ***

Remark 5

Consider an exponential family of probability distributions where, for ${\bf{x}}\in\mathbb{S}$ ,

[TABLE]

Then each member of this family can also be re-written in any of the following equivalent forms:

[TABLE]

where $Z^{\prime}(\theta)=\ln Z(\theta)$ . Analogous to (49), (50) and (51), respectively, the probability distributions in $\mathscr{E}^{(\alpha)}$ , $\mathbb{M}^{(\alpha)}$ and $\mathbb{B}^{(\alpha)}$ -families can be expressed, for ${\bf{x}}\in\mathbb{S}$ , as

[TABLE]

where the $\alpha$ -exponential function $e_{\alpha}:[-\infty,\infty]\to(-\infty,\infty]$ is defined as

[TABLE]

The $\alpha$ -exponential function coincides with the usual exponential function as $\alpha\to 1$ . Hence the families $\mathscr{E}^{(\alpha)}$ , $\mathbb{M}^{(\alpha)}$ and $\mathbb{B}^{(\alpha)}$ coincide with the usual exponential family as $\alpha\to 1$ . Thus these three power-law families can be seen as generalizations of the exponential family. These power-law families are sometimes known as deformed exponential families (see, for example, [48]).**

III Projection theorems for general power-law families

In this section, we extend the projection theorems of $B_{\alpha}$ , $\mathscr{I}_{\alpha}$ and $D_{\alpha}$ divergences to the general power-law families by directly solving the associated estimating equations. We also find conditions under which the new projection theorems reduce to the ones as in the canonical case. We shall begin by recalling the projection theorems known in the literature. In the following, assume that the families are canonical and regular with support $\mathbb{S}$ being finite and the parameter space $\Theta$ being the natural parameter space. Let $p_{n}$ denote the empirical distribution of sample ${X}_{1},\ldots,{X}_{n}$ .

(a)

Projection theorem for $B_{\alpha}$ -divergence: Consider a $\mathbb{B}^{(\alpha)}$ -family characterized by $f$ and $h=q^{\alpha-1}$ , where $q$ is a probability distribution with support $\mathbb{S}$ . The reverse $B_{\alpha}$ -projection of $p_{n}$ on $\mathbb{B}^{(\alpha)}$ satisfies

[TABLE]

where $\bar{f}:=[\bar{f_{1}},\ldots,\bar{f_{k}}]^{\top}$ , $\bar{f_{i}}:=\frac{1}{n}\sum_{j=1}^{n}f_{i}(X_{j})$ for $i\in\{1,\ldots,k\}$ and $\mathbb{E}_{\theta}[\cdots]$ denotes expected value with respect to $p_{\theta}$ . (See Theorem 23 and Remark 12). Ohara and Wada [52, Prop. 3] established (53) for the continuous case with $h$ being identically a constant. Csiszár and Matúš [26] studied this for the general Bregman divergences. 2. (b)

Projection theorem of $\mathscr{I}_{\alpha}$ -divergence: Consider an $\mathbb{M}^{(\alpha)}$ -family characterized by $f$ and $h=q^{\alpha-1}$ , where $q$ is a probability distribution with support $\mathbb{S}$ . The reverse $\mathscr{I}_{\alpha}$ -projection of $p_{n}$ on $\mathbb{M}^{(\alpha)}$ satisfies

[TABLE]

where $\overline{h}:=\frac{1}{n}\sum_{j=1}^{n}h(X_{j})$ . This is due to [43, Theorem 18 and Theorem 21]. 3. (c)

Projection theorem of $D_{\alpha}$ -divergence: Consider an $\mathscr{E}^{(\alpha)}$ -family characterized by $f$ and $h=q^{1-\alpha}$ , where $q$ is a probability distribution with support $\mathbb{S}$ . The reverse $D_{\alpha}$ -projection of $p_{n}$ on $\mathscr{E}^{(\alpha)}$ satisfies

[TABLE]

where $\mathbb{E}_{\theta^{(\alpha)}}[\cdots]$ denotes expectation with respect to $p_{\theta}^{(\alpha)}$ , $\overline{h}^{(\alpha)}$ and $\overline{f_{i}}^{(\alpha)}$ are respectively averages of $h$ and $f_{i}$ with respect to $p_{n}^{(\alpha)}$ . This is due to [41, Theorem 6].

Before we turn to the main results of this section we prove the following lemma and corollary that establish a connection between the generalized Hellinger and the Jones et al. estimating equations.

Lemma 8

The estimating equations (7) and (10) are the same up-to the transformation $p\mapsto p^{(\alpha)}$ when $\mathbb{S}$ is discrete. In the continuous case the same is true between (8) and (10) provided the empirical distribution $p_{n}$ is replaced by a continuous estimate $\widetilde{p}_{n}$ and $\int\nabla[p_{\theta}({\bf{x}})]d{\bf{x}}=0$ .

Proof 5

We present the proof for the discrete case. The proof for the continuous case follows by a similar argument if we replace $p_{n}$ by $\widetilde{p}_{n}$ throughout in the proof.

The generalized Hellinger estimating equation (7) can be re-written as

[TABLE]

since $\sum\limits_{{\bf{x}}\in\mathbb{S}}p_{\theta}({\bf{x}})s({\bf{x}};\theta)=0$ . This can further be re-written as

[TABLE]

Observe that

[TABLE]

Hence

[TABLE]

where $A(\theta)=\nabla\ln\|p_{\theta}\|$ . Plugging (57) in (56),

[TABLE]

This is same as the Jones et al. estimating equation (10) with $p_{n}$ , $p_{\theta}$ and $\alpha$ , respectively, replaced by $p_{n}^{(\alpha)}$ , $p_{\theta}^{(\alpha)}$ and $1/\alpha$ .

This, together with Lemma 7, establishes the following equivalence between the Jones et al. estimation and generalized Hellinger estimation.

Corollary 9

Suppose that $\mathscr{E}^{(\alpha)}$ is an $\alpha$ -exponential family characterized by $h,w,f,\Theta$ where all the distributions have a common support $\mathbb{S}$ . Then, under the assumptions of Lemma 8, solving the generalized Hellinger estimation problem under $\mathscr{E}^{(\alpha)}$ -family is equivalent to solving the Jones et al. estimation problem under the $\mathbb{M}^{(1/\alpha)}$ -family characterized by the same entities.

The following result extends the already known projection theorems of the divergences $B_{\alpha}$ , $\mathscr{I}_{\alpha}$ and $D_{\alpha}$ to the general power-law families as defined in Section II.

Theorem 10

Let ${\bf{X}}_{1},\ldots,{\bf{X}}_{n}$ be $n$ i.i.d. samples. Let $\Pi$ be one of the families $\mathbb{B}^{(\alpha)}$ , $\mathbb{M}^{(\alpha)}$ , or $\mathscr{E}^{(\alpha)}$ and assume that support of $\Pi$ does not depend on the parameter space $\Theta$ . In (c), assume also that $\int{\partial_{r}}[p_{\theta}({\bf{x}})]d{\bf{x}}=0$ for $r\in\{1,\ldots,k\}$ . Then the following hold.

(a)

Basu et al. estimator under $\mathbb{B}^{(\alpha)}$ must satisfy

[TABLE]

(b)

Jones et al. estimator under $\mathbb{M}^{(\alpha)}$ must satisfy

[TABLE]

(c)

Generalized Hellinger estimator under $\mathscr{E}^{(\alpha)}$ must satisfy

[TABLE]

Here $\partial_{r}[w(\theta)]:=\big{[}\tfrac{\partial}{\partial\theta_{r}}[w_{1}(\theta)],\ldots,\tfrac{\partial}{\partial\theta_{r}}[w_{s}(\theta)]\big{]}^{\top}$ for $r\in\{1,\ldots,k\}$ . In (c), $\overline{h}^{(\alpha)}:=\mathbb{E}_{\widetilde{p}_{n}^{(\alpha)}}[h(\textbf{{X}})]$ and $\overline{f}^{(\alpha)}:=\mathbb{E}_{\widetilde{p}_{n}^{(\alpha)}}[f(\textbf{{X}})]$ , where $\widetilde{p}_{n}$ is the empirical distribution $p_{n}$ in the discrete case; a suitable continuous estimate of ${p}_{n}$ in the continuous case.

Proof 6

(a) If $p_{\theta}\in\mathbb{B}^{(\alpha)}$ then from Definition 1, for ${\bf{x}}\in\mathbb{S}$ ,

[TABLE]

Taking derivative with respect to $\theta_{r}$ for $r\in\{1,\ldots,k\}$ ,

[TABLE]

The Basu et al. estimating equation (9) can be re-written as

[TABLE]

*Substituting (61) in (62), we get (58).

(b) If $p_{\theta}\in\mathbb{M}^{(\alpha)}$ , using Definition 3, for ${\bf{x}}\in\mathbb{S}$ ,

[TABLE]

Taking derivative with respect to $\theta_{r}$ for $r\in\{1,\ldots,k\}$ , we get

[TABLE]

Substituting this in the Jones et al. estimating equation (10),

[TABLE]

(c) This follows from (b) and Corollary 9.

Remark 6

(a)

Jones et al. and generalized Hellinger estimation under $\mathbb{B}^{(\alpha)}$ :* Recall that a $\mathbb{B}^{(\alpha)}$ -family can be expressed as an $\mathbb{M}^{(\alpha)}$ -family as in (36) or (37). This implies that the Jones et al. estimator under $\mathbb{B}^{(\alpha)}$ -family satisfies (59) with $w$ , $f$ replaced by $\widetilde{w},\widetilde{f}$ as defined in (36) or (37). Further, in view of Remark 4(a), (36) or (37) is also an $\mathscr{E}^{(\alpha^{\prime})}$ -family where $\alpha^{\prime}=2-\alpha$ . Thus the $\alpha^{\prime}$ -generalized Hellinger estimator under $\mathbb{B}^{(\alpha)}$ -family must satisfy (60) with $w$ , $f$ and $\alpha$ replaced, respectively, by $\widetilde{w},\widetilde{f}$ and $\alpha^{\prime}$ .***

(b)

Basu et al. and generalized Hellinger estimation under $\mathbb{M}^{(\alpha)}$ :* An $\mathbb{M}^{(\alpha)}$ -family can be expressed as a $\mathbb{B}^{(\alpha)}$ -family as in (35). Thus the Basu et al. estimator under an $\mathbb{M}^{(\alpha)}$ -family must satisfy (58) with $w$ and $f$ replaced by $\widetilde{w},\widetilde{f}$ as defined in (35). Further, in view of Remark 4(a), the $\alpha^{\prime}$ -generalized Hellinger estimator under $\mathbb{M}^{(\alpha)}$ -family must satisfy (60) with $\alpha$ replaced by $\alpha^{\prime}$ .***

(c)

Basu et al. and Jones et al. estimation under $\mathscr{E}^{(\alpha)}$ :* An $\mathscr{E}^{(\alpha)}$ -family can be expressed as an $\mathbb{M}^{(\alpha^{\prime})}$ -family as in (46), and hence can be expressed as a $\mathbb{B}^{(\alpha^{\prime})}$ -family as in (35) with $\alpha$ replaced by $\alpha^{\prime}$ . Thus the $\alpha^{\prime}$ -Jones et al. estimator under $\mathscr{E}^{(\alpha)}$ -family satisfies (59). Similarly the $\alpha^{\prime}$ -Basu et al. estimator under $\mathscr{E}^{(\alpha)}$ -family satisfies (58) with $w$ and $f$ replaced by $\widetilde{w},\widetilde{f}$ as in (35).

We now show that, when the families are regular, the projection equations in Theorem 10 reduce to the one as in the canonical case.

Corollary 11

The estimating equations (58), (59) and (60), respectively, reduce to (53), (54) and (55) if the underlying families are regular.

Proof 7

Let us first observe that for a regular family the matrix $[\partial_{i}(w_{j}(\theta))]_{k\times k}$ is non-singular for $\theta\in\Theta$ . To see this, let

[TABLE]

for some scalars $c_{1},\ldots,c_{k}$ and for each $r\in\{1,\dots,k\}$ . Then

[TABLE]

for some constant $c$ . Now linear independence of $1$ , $w_{1},\dots,w_{k}$ implies that $c=c_{1}=\cdots=c_{k}=0$ . Consider a regular $\mathbb{B}^{(\alpha)}$ -family. Then from (58), we have

[TABLE]

Since $[\partial_{i}(w_{j}(\theta))]_{k\times k}$ is non-singular, (63) reduces to $\mathbb{E}_{\theta}[f(\textbf{X})]=\bar{f}$ . Again (59) can be re-written as

[TABLE]

For a regular $\mathbb{M}^{(\alpha)}$ -family, this reduces to

[TABLE]

This implies

[TABLE]

That is,

[TABLE]

Hence

[TABLE]

Substituting this back in (64),

[TABLE]

In a similar fashion, the result for regular $\mathscr{E}^{(\alpha)}$ -family can be shown.

Theorem 10 fails if the support of the underlying family depends on the parameters. We show this by an example in Section IV-B.

Basu et al. estimating equation (9) differs from the Jones et al. estimating equation (10) in which the weights are normalized. Much research has been done to compare these two methods (for example, see [38]). We saw in Section II that a regular $\mathbb{B}^{(\alpha)}$ -family can be viewed as a regular $\mathbb{M}^{(\alpha)}$ -family under some conditions. In the following, we show that the two estimations coincide on a regular $\mathbb{B}^{(\alpha)}$ -family with $h$ being a non-zero constant (or on a regular $\mathbb{M}^{(\alpha)}$ -family with $h$ being a non-zero constant).

Theorem 12

For a regular $\mathbb{B}^{(\alpha)}$ -family with $h$ being identically a non-zero constant, Basu et al. estimating equation (58) and Jones et al. estimating equation (59) are the same.

Proof 8

Consider the $\mathbb{B}^{(\alpha)}$ -family as in (39). If it is regular, from Corollary 11, the Basu et al. estimating equation is given by

[TABLE]

We now show that the Jones et al. estimating equation (59) for this family is also the same. Recall that (39) can be written as an $\mathbb{M}^{(\alpha)}$ -family as in (40). The proof is divided into two parts.

(i)

Suppose that $F(\theta)$ is linearly independent with $1,w_{i}(\theta)$ ’s. Then (40) forms a regular $\mathbb{M}^{(\alpha)}$ -family by Proposition 4. Therefore using Corollary 11, we see that the Jones et al. estimating equation (59) for (40) is same as (65), since $h$ is identically a constant.

(ii)

Next let us suppose that $F(\theta)$ is linearly dependent with $1,w_{i}(\theta)$ ’s. Then there exists scalars $c_{0},c_{1},\ldots,c_{k}$ (not all zero) such that

[TABLE]

Then

[TABLE]

Using Theorem 10(b), the Jones et al. estimating equation (59) for (40) is given, for $r\in\{1,\ldots,k\}$ , by

[TABLE]

Substituting the value of $S(\theta)$ , an easy calculation yields, for $r\in\{1,\ldots,k\}$ ,

[TABLE]

Using (66) and (67) in (68),

[TABLE]

where $c=[c_{1},\ldots,c_{k}]^{\top}$ . Since $1,w_{1},\ldots,w_{k}$ are linearly independent, $[\partial_{i}(w_{j}(\theta))]_{k\times k}$ is non-singular. Using this, (69) becomes

[TABLE]

Proceeding as in Corollary 11,

[TABLE]

That is,

[TABLE]

Hence $\mathbb{E}_{\theta}[f(\textbf{X})+c]=\bar{f}+c,~{}\text{and thus}~{}\mathbb{E}_{\theta}[f(\textbf{X})]=\bar{f}$ .

Corollary 13

For a regular $\mathbb{M}^{(\alpha)}$ -family as in (31) with $h$ being identically a non-zero constant, the Basu et al. and the Jones et al. estimating equations are the same if $Z^{1-\alpha}$ is linearly independent with $w_{i}$ ’s.

IV Applications: Generalized estimation under Student and Cauchy distributions

In this section we find Jones et al. estimators [38] for the parameters of Student distribution for $\nu\in(0,\infty)$ and generalized Hellinger estimators [6] of Cauchy distribution for $\beta\in(1,(d+2)/2)$ . For the estimation of Cauchy distributions we use the kernel density estimate for the empirical measure. We also find a robust estimator of the mean parameter of Student distribution for the case when $\nu\notin(0,\infty)$ .

IV-A Basu et al. [5] and Jones et al. [38] estimation under Student distributions

In Theorem 2 we saw that for $\alpha\in\big{(}(d-2)/d,1\big{)}$ (that is, for $\nu\in(0,\infty)$ ) Student distributions form a $d(d+3)/2$ -parameter regular $\mathbb{B}^{(\alpha)}$ -family with $f^{(1)}({\bf{x}})={\bf{x}}$ and $f^{(2)}({\bf{x}})=\rm{vec}({\bf{x}}{\bf{x}}^{\top})$ . Hence to find the Basu et al. estimators of the parameters, its mean and variance should be finite. However, as we saw in Example 1, (1) does not have finite mean and variance for $\alpha\in\big{(}(d-2)/d,d/(d+2)\big{]}$ . Hence we restrict ourselves to Student distributions for $\alpha\in\big{(}d/(d+2),1\big{)}$ . The mean and the covariance of a Student distribution for $\alpha\in\big{(}d/(d+2),1\big{)}$ are given by $\boldsymbol{\mu}$ and $\boldsymbol{K}:=(k_{ij})_{d\times d}=[\nu/(\nu-2)]\cdot\boldsymbol{\Sigma}$ respectively. Let $\textbf{X}_{1},\ldots,\textbf{X}_{n}$ be an i.i.d. sample where each $\textbf{X}_{i}=[X_{1i},\ldots,X_{di}]^{\top}$ for $i\in\{1,\ldots,n\}$ . Suppose also that the true distribution $p$ is a Student distribution as in (1). Using Corollary 11, the Basu et al. estimators of $\boldsymbol{\mu}$ and $\boldsymbol{K}$ are given, for $i,j\in\{1,\ldots,d\}$ and $i\leq j$ , by

[TABLE]

where $\overline{\textbf{X}}=\frac{1}{n}\sum_{i=1}^{n}\textbf{X}_{i}$ .

Next consider the Student distributions as in (33) with $\alpha\in\big{(}d/(d+2),1\big{)}$ . We saw that it forms a $d(d+3)/2$ -parameter regular $\mathbb{M}^{(\alpha)}$ -family with $h\equiv 1$ . Hence, from Theorem 12, the Jones et al. estimators for $\boldsymbol{\mu}$ and $\boldsymbol{K}$ are the same as the Basu et al. estimators as in (70). In [30] and [33, Theorem 5] this was solved directly from the estimating equation (59). We summarize the above results in the following.

Theorem 14

For $\alpha\in(d/(d+2),1)$ , the Basu et al. and the Jones et al. estimators of mean and covariance parameters of a $d$ -dimensional Student distribution as in (21) are the same and are given by (70).

Remark 7

It can be shown that, as $\nu\to+\infty$ , Student distributions coincide with a normal distribution with mean $\boldsymbol{\mu}$ and covariance matrix $\boldsymbol{K}$ [37]. Similarly, as $\alpha\to 1^{-}$ , Basu et al. estimating equation or Jones et al. estimating equation becomes ML estimating equation (5). Thus there is a continuity of the generalized estimators of mean and covariance parameters as $\nu\to+\infty$ . This suggests that when the samples are from a Student distribution with sufficiently large $\nu$ , MLE of its parameters can be approximated by a generalized estimator of the respective parameters. (Note that the MLE of Student distributions do not have closed-from solution and numerical methods must be resorted to solve it [2, 29].) Simulations suggest that, for $\nu\geq 40$ , generalized estimators (70) are close to MLE even for small sample size.**

For $\alpha\notin((d-2)/d,1)$ , the support of Student distributions depend on the parameters. Thus Theorem 10 can not be used to find the estimators. However, in this case, one can find the estimators by maximizing the respective likelihood function as described in the following.

IV-B Jones et al. estimation [38] under Student distributions for $\nu\notin(0,\infty)$

For simplicity we deal only the one-dimensional case. Suppose that $X_{1},\ldots,X_{n}$ is an i.i.d. sample where $X_{1}\leq\cdots\leq X_{n}$ . Suppose also that the true distribution $p$ is a Student distribution with some known $\alpha>1$ (that is, $\nu<-1$ ) and variance, say $\sigma^{2}=1$ :

[TABLE]

where $N_{\alpha}$ is the normalizing factor. The support of $p_{\mu}$ is given by $\mathbb{S}=\{x:\mu-c_{\alpha}\leq x\leq\mu+c_{\alpha}\}$ , where $c_{\alpha}:=\sqrt{-1\big{/}b_{\alpha}}$ . (Recall that $b_{\alpha}<0$ for $\alpha>1$ ). Observe that (71) defines an $\mathbb{M}^{(\alpha)}$ -family whose support depends on the unknown parameter. We now show that the Jones et al. estimator of $\mu$ could be different from $\overline{X}$ . Since the support of $p_{\mu}$ depends on $\mu$ , we cannot apply Theorem 10. Hence we resort to the maximization of the associated likelihood function:

[TABLE]

The likelihood function, for (71), becomes

[TABLE]

where ${\bf 1}(\cdot)$ denotes the indicator function and $I_{i}=[X_{i}-c_{\alpha},X_{i}+c_{\alpha}]$ , $i\in\{1,\ldots,n\}$ . Observe that the maximizer of $L_{3}^{(\alpha)}(\mu)$ is same as the maximizer of

[TABLE]

(Note that, Basu et al. likelihood function (13) for the model in (71) also reduces to (72); hence both the estimators are the same in this case.) It is clear from (72) that $\ell^{(\alpha)}(\mu)$ is positive if and only if $\mu$ lies at least in one $I_{i}$ for $i\in\{1,\ldots,n\}$ . Thus, to find the maximizer $\widehat{\mu}$ of $\ell^{(\alpha)}(\mu)$ , we only need to consider the cases when $\mu$ lies in one of the $I_{i}$ ’s.

Consider $I_{1}$ . If $I_{1}$ is disjoint from all other $I_{j}$ for $j\neq 1$ , then $\ell^{(\alpha)}(\mu)$ equals to $[1+b_{\alpha}(X_{1}-\mu)^{2}]$ for $\mu\in I_{1}$ . Similarly, if $I_{1}\cap I_{2}\neq\emptyset$ but all other $I_{j}$ ’s for $j\notin\{1,2\}$ , are disjoint from $I_{1}$ , then the value of $\ell^{(\alpha)}(\mu)$ in $I_{1}$ is given by:

[TABLE]

In general, if $I_{2},I_{3},\ldots,I_{k}$ for some $k\in\{2,\dots,n\}$ satisfy $\cap_{i=1}^{k}I_{i}\neq\emptyset$ and all other $I_{j}$ ’s for $j\notin\{1,\dots,k\}$ are disjoint from $I_{1}$ , then $I_{1}$ can be divided into $k$ disjoint sub-intervals and in each of these sub-intervals the value of $\ell^{(\alpha)}(\mu)$ is given, for $\mu\in\big{(}\cap_{i=1}^{j}I_{i}\big{)}\setminus\big{(}\cup_{i=j+1}^{k}I_{i}\big{)},~{}j\in\{1,\ldots,k\}$ , where $\cup_{k+1}^{k}I_{i}:=\emptyset$ , by

[TABLE]

Let us define ${\bf{1}}(\mu\in\emptyset)=0$ . Then for $\mu\in I_{1}$ , we can write

[TABLE]

Let $I_{2}^{\prime}:=I_{2}\setminus I_{1}$ . Then proceeding as above, for $\mu\in I_{2}^{\prime}$ , we have

[TABLE]

In general, let $I_{k}^{\prime}:=I_{k}\setminus\big{(}\cup_{i=1}^{k-1}I_{i}\big{)}$ , $k\in\{1,\ldots,n\}$ . Then for $\mu\in I_{k}^{\prime}$ ,

[TABLE]

Hence for $\mu\in I_{k}^{\prime}$ , $k\in\{1,\dots,n\}$ , we can divide $I_{k}^{\prime}$ into at most $(n-k+1)$ sub-intervals $I_{k}^{j}:=\big{[}\big{(}I_{k}^{\prime}\cap(\cap_{i=k}^{j}I_{i})\big{)}\setminus\big{(}\cup_{i=j+1}^{n}I_{i}\big{)}\big{]}$ , $j\in\{k,\dots,n\}$ , such that the indicator functions in (74) will be positive for $\mu$ in either of these sub-intervals. For example, in Figure 1 we considered a case where $I_{1}^{\prime}$ can be divided into three disjoint sub-intervals, namely $I_{1}^{1}$ , $I_{1}^{2}$ and $I_{1}^{3}$ , and $I_{2}^{\prime}$ can be divided into two disjoint sub-intervals $I_{2}^{3}$ and $I_{2}^{4}$ , and so on. The maximizer of $\ell^{(\alpha)}(\mu)$ for $\mu$ in each of these sub-intervals $I_{k}^{j}$ can be found in the following way.

Let $k$ and $j$ be such that $I_{k}^{j}\neq\emptyset$ . Then the following possible cases appear:

(i)

$j=k$ and $I_{k}^{j+1}\neq\emptyset$ :

[TABLE]

(ii)

$j=k$ and $I_{k}^{j+1}=\emptyset$ :

[TABLE]

(iii)

$j>k$ and $I_{k}^{j+1}\neq\emptyset$ :

[TABLE]

(iv)

$j>k$ and $I_{k}^{j+1}=\emptyset$ :

[TABLE]

Also from (74), for $\mu\in I_{k}^{j}$ , $\ell^{(\alpha)}(\mu)=\sum\limits_{i=k}^{j}[1+b_{\alpha}(X_{i}-\mu)^{2}]$ . Since

[TABLE]

$\ell^{(\alpha)}(\mu)$ is monotone increasing for $\mu\leq\frac{1}{j-k+1}\sum_{i=k}^{j}X_{i}$ , and monotone decreasing otherwise. Thus the local maximizer of $\ell^{(\alpha)}(\mu)$ in any non-empty $I_{k}^{j}$ for $j\in\{k,k+1,\dots,n\}$ is given by

[TABLE]

where $I_{L}$ and $I_{R}$ are respectively the left and right ends of the sub-interval $I_{k}^{j}$ as in (i)-(iv).

Figure 2 shows two different cases of maximizer of $\ell^{(\alpha)}(\mu)$ for $\alpha=2$ ( $c_{\alpha}=\sqrt{5}$ here) in the sub-interval $[X_{2}-\sqrt{5},X_{3}-\sqrt{5}]$ .

Observe that, in this process we divide $\cup_{i=1}^{n}I_{i}$ into a finite number of non-empty disjoint sub-intervals such that $\ell^{(\alpha)}(\mu)$ is positive if and only if $\mu$ lies in one of these sub-intervals. In each of these sub-intervals, $\ell^{(\alpha)}(\mu)$ has a unique maximizer. Thus we have a finite number of local maximizers of $\ell^{(\alpha)}(\mu)$ in $\cup_{i=1}^{n}I_{i}$ , and hence the global maximizer is one among these local maximizers. This implies that the global maximizer can be different from the sample mean.

Remark 8

Observe that $c_{\alpha}\to\infty$ when $\alpha\to 1$ . Thus for $\alpha\to 1$ , the length of the intervals $I_{i}=[X_{i}-c_{\alpha},X_{i}+c_{\alpha}]$ for $i\in\{1,\dots,n\}$ increases, and hence all the indicator functions in (72) become positive for any $\mu\in\mathbb{R}$ . This implies that the maximizer of $\ell^{(\alpha)}(\mu)$ is the usual sample mean $\overline{X}$ . This complies with the case $\alpha=1$ as the MLE of the mean parameter of a normal distribution is $\overline{X}$ . Recall that the likelihood function $L_{3}^{(\alpha)}$ coincides with the usual $\log$ likelihood function and the Student distribution coincides with the normal distribution as $\alpha\to 1$ .**

To demonstrate the algorithm, we generated the following random sample from the mixture $0.8p+0.2\mathcal{N}(10,1)$ , where $p$ is the Student distribution with $\alpha=2$ , $\mu=0$ and $\sigma=1$ :

[TABLE]

Consider $\mu\in I^{\prime}_{1}=[-3.9648,0.5074]$ . Then $I_{1}^{j}\neq\emptyset$ for all $j\in\{1,\dots,16\}$ . Using the formula (79), we have the following local maximizers of $\ell^{(\alpha)}(\mu)$ in each of these sub-intervals $I_{1}^{j}$ , $j\in\{1,\dots,16\}$ , of $I_{1}^{\prime}$ :

[TABLE]

Next consider $\mu\in I_{2}^{\prime}=[0.5074,1.06]$ . Then the indicator functions in (IV-B) are positive only when $j=16$ , that is, $I_{2}^{j}\neq\emptyset$ only for $j=16$ . Using (79), we get the maximizer of $\ell^{(2)}(\mu)$ in $I_{2}^{16}(=I_{2}^{\prime})$ is $0.5074$ . Similarly, we have only one maximizer in each $I_{i}^{\prime}$ for $i\in\{3,\dots,16\}$ and they respectively are:

[TABLE]

Next consider $\mu\in I_{17}^{\prime}=[6.4140,10.8862]$ . Then $I_{17}^{j}\neq\emptyset$ for $j\in\{17,\dots,20\}$ . We then have four local maximizers of $\ell^{(2)}(\mu)$ in four sub-intervals of $I_{17}^{\prime}$ , namely $I_{17}^{j}$ for $j\in\{17,\dots,20\}$ , and they respectively are $8.4893$ , $9.6878$ , $10.7150,10.8862$ . Similarly for $\mu\in I_{i}^{\prime},i\in\{18,19,20\}$ , $\ell^{(2)}(\mu)$ has only one local maximizer in each $I_{i}^{\prime}$ and they respectively are $12.1766,12.9615,15.0055$ .

Comparing the values of $\ell^{(2)}(\mu)$ at each of the local maximizers, we get $0.3674$ as the global maximizer of $\ell^{(2)}(\mu)$ . Hence the Jones et al. (or Basu et al.) estimator for $\mu$ is $\widehat{\mu}=0.3674$ , which is different from $\overline{X}=1.2940$ . Also MLE and MLE after deleting the outliers are, respectively, 2.173 and 0.1721. We repeated this exercise with $p$ being a Student distribution with mean zero and $\nu=-5$ ( $\alpha=1.5$ ). The results are shown in Table I. We observe that Jones et al. (or Basu et al.) estimator is close to the true parameter (also to the ML-estimator without outliers) as $\alpha$ gets close to $1$ from right.

IV-C Generalized Hellinger estimation under Cauchy distributions

Consider the Cauchy distributions as in (48). Here we find the generalized Hellinger estimators for their location and scale parameters using a kernel density estimate for the sample empirical measure. Let $\textbf{X}_{i}=[X_{1i},\dots,X_{di}]^{\top}$ , $i\in\{1,\dots,n\}$ be an i.i.d. sample. Let $\widetilde{p}_{n}$ be a suitable continuous density estimator of the empirical measure $p_{n}$ . When $\beta\in\big{(}1,(d+2)/d\big{)}$ , we saw that Cauchy distributions form a regular $\mathscr{E}^{(\beta)}$ -family. Thus in this case we use Corollary 11 to estimate its parameters. But for $\beta\notin\big{(}1,(d+2)/d\big{)}$ , the support of this distribution depends on the unknown parameters. In this case one can estimate the parameters by maximizing the associated likelihood function as we did for Student distributions in Section IV-B.

Let $\beta\in\big{(}1,(d+2)/d\big{)}$ . The characterizing entities of Cauchy distributions as a regular $\mathscr{E}^{(\beta)}$ are respectively $h({\bf{x}})\equiv 1$ , $f^{(1)}({\bf{x}})={\bf{x}}$ and $f^{(2)}({\bf{x}})=\text{vec}({\bf{x}}{\bf{x}}^{T})$ (see Example 5). Using (55), we therefore have the following estimating equations

[TABLE]

Let us find $\mathbb{E}_{\eta^{(\beta)}}[\textbf{X}]$ and $\mathbb{E}_{\eta^{(\beta)}}[\text{vec}(\textbf{X}\textbf{X}^{T})]$ . In Example 5, we saw that $p_{\theta}^{(\alpha)}=q_{\eta},~{}\text{where}~{}\alpha=1/\beta$ . Hence $q_{\eta}^{(\beta)}=p_{\theta}$ . Thus

[TABLE]

where ${\bf{X}}=[X_{1},\dots,X_{d}]^{T}$ and $k_{ij}=[\nu/(\nu-2)]\cdot\sigma_{ij}$ . Using this in (81), we get

[TABLE]

for $i,j\in\{1,\dots,d\}$ , $i\leq j$ . Thus generalized Hellinger estimators of the location and scale parameters are

[TABLE]

for $i,j\in\{1,\dots,d\}$ and $i\leq j$ . We summarize these in the following theorem.

Theorem 15

Let $\beta\in\big{(}1,(d+2)/d\big{)}$ and $\widetilde{p}_{n}$ be a suitable continuous estimate of the sample empirical measure. Then the generalized Hellinger estimators of the location and scale parameters of a $d$ -dimensional Cauchy distribution are given by (84).

Remark 9

In Example 6, We saw that Student distributions form a regular $\mathscr{E}^{(\alpha^{\prime})}$ -family for $\alpha^{\prime}\in\big{(}1,(d+2)/d\big{)}$ . Thus one can do the $\alpha^{\prime}$ -generalized Hellinger estimation on Student distributions as well.

Notice that the estimators in (84) involve a continuous estimate $\widetilde{p}_{n}$ of $p_{n}$ . In the following we present examples where we use ‘kernel density estimation’ to find such $\widetilde{p}_{n}$ and use it to find the estimators. In the literature the commonly used kernel to estimate the $d$ -dimensional empirical measure is of the following form (see [61]):

[TABLE]

where $\xi$ is a symmetric distribution on $\mathbb{R}^{d}$ and $\{h_{n}\}$ is a sequence of real numbers with suitable properties, called bandwidth. These properties of the kernel $\xi$ and bandwidth $h_{n}$ influence the performance of the estimators greatly. There is no general theory in the literature to choose the right continuous estimate for a given problem. However authors like Beran [7], Tamura and Boos [61], and Simpson [58] imposed conditions on $\xi$ and $h_{n}$ so that the estimators perform better in their specific setting. We shall use the following two kernels, namely uniform kernel and Epanechnikov kernel, with bandwidth $h_{n}=n^{-1/2d}$ for $n\geq 1$ to find the estimators. Both the kernels and $h_{n}$ satisfy the following conditions which guarantee the $L_{1}$ convergence of $\widetilde{p}_{n}$ to the true density [61, Lem. 3.1] (see also [6, Sec. 3.3] and the references therein):

(i)

$\xi$ is symmetric about [math] and has compact support.

(ii)

$\lim\limits_{n\to\infty}h_{n}=0$ and $\lim\limits_{n\to\infty}[h_{n}+(nh_{n}^{d})^{-1}]=0$ .

We first use the $d$ -dimensional uniform kernel which is defined as follows.

[TABLE]

Let us denote $[X_{1i}-n^{-1/2d},X_{1i}+n^{-1/2d}]\times\dots\times[X_{di}-n^{-1/2d},X_{di}+n^{-1/2d}]$ by $\big{[}\textbf{X}_{i}-n^{-1/2d},\textbf{X}_{i}+n^{-1/2d}\big{]}$ for $i\in\{1,\dots,n\}$ and call them rectangles. Assume that all these rectangles are disjoint (these are actually disjoint for sufficiently large $n$ ). Then from (85) we have

[TABLE]

That is,

[TABLE]

Thus $\widetilde{p}_{n}$ is the uniform distribution on $\bigcup_{i=1}^{n}\big{[}\textbf{X}_{i}-n^{-1/2d},\textbf{X}_{i}+n^{-1/2d}\big{]}.$ This implies that the $\beta$ -scaled distribution $\widetilde{p}_{n}^{(\beta)}$ is the same as $\widetilde{p}_{n}$ . Therefore, we have

[TABLE]

and

[TABLE]

Hence the estimators for the parameters are given by

[TABLE]

for $i,j=1,\dots,d$ and $i\leq j$ , where $\epsilon_{n}=\frac{1}{3n^{1/d}}$ if $i=j$ and equals to zero otherwise.

Next we find the estimators using the following $d$ -dimensional Epanechnikov kernel

[TABLE]

where $\|\cdot\|$ denotes the Euclidean norm. We thus have

[TABLE]

This yields the same estimators for $\mu$ and $k_{ij}$ as in (86) except the correction term $\epsilon_{n}$ which differs only up-to a scale factor. For example when $\beta=2$ , $\epsilon_{n}$ changes to $\frac{1}{7n}$ for $d=1$ , to $\frac{17}{91n^{1/2}}$ for $d=2$ , to $\frac{25}{63n^{1/3}}$ for $d=3$ , and so on.

Observe that, for $\beta\in\big{[}(d+4)/(d+2),(d+2)/d\big{)}$ (that is, $\nu<2$ ), Cauchy distributions do not have finite mean and variance. Hence the estimates (86) oscillate as one increases the sample size $n$ . In this case some other smoothing techniques or changing the bandwidth in Theorem 15 may produce a better estimator. A general theory for this is part of our future work. However for $\beta\in\big{(}1,(d+4)/(d+2)\big{)}$ , these estimators are close to the true parameters. In Table II, we summarize the results of a simulation study where we find the estimators by taking average of 25 different sets of random samples of size $50$ drawn from a standard Cauchy distribution using both the kernels for different degrees of freedom. We observe that, for $\nu>2$ , the estimators are bounded, and as $\nu$ increases, their performances get better.

V Summary and concluding remarks

Projection theorems of Jones et al. ( $\mathscr{I}_{\alpha}$ ) and Hellinger ( $D_{\alpha}$ ) divergences tell us that the reverse projection, respectively, on the power-law families $\mathbb{M}^{(\alpha)}$ and $\mathscr{E}^{(\alpha)}$ turns out to be a forward projection on a “simpler” (linear or $\alpha$ -linear) family which, in turn, reduces to a linear problem on the underlying probability distribution. The applicability of these projection theorems known in the literature were limited as they dealt only discrete and canonical models. In this work, we first generalized the associated power-law families to a more general set-up including the continuous case. We observed that these two families are related through an escort transformation, apart from the $\alpha\leftrightarrow(2-\alpha)$ transformation studied in [64]. We then introduced the notion of regularity for these power-law families analogous to the concept of regular exponential family (or full rank family). This makes these families unique from similar families studied in the literature, namely, the $\phi$ -exponential class defined in [51], the $\mathcal{F}_{[\beta h]}$ class defined in [26] and so on. We then extended the projection theorems of $\mathscr{I}_{\alpha}$ and $D_{\alpha}$ to these general form of the power-law families by solving the respective estimating equations. We observed that, for regular families, the new estimating equations coincide with the respective projection equations, similar to the ones in the canonical case. We also observed that both the estimating equations were characterized by some specific statistics of the samples. Such a characterization is well-known in the literature for the pair MLE and regular exponential family (See, for example, [14, pp. 149–150]). We finally showed that the Student and Cauchy distributions respectively form a regular $\mathbb{M}^{(\alpha)}$ and $\mathscr{E}^{(\alpha)}$ , and they are the escort distributions of each other. We then applied the above projection theorems to find generalized estimators for their parameters. Interestingly, both the Basu et al. and Jones et al. estimators for the mean and covariance parameters of a $d$ -dimensional Student distribution for $\nu>2$ are the same as the MLE of the respective parameters of a normal distribution, where $\nu$ is the degrees of freedom. We also found a more general class of distributions that includes Student distributions for which both the estimations are the same (Theorem 12). In [30, Eq. (38)], it was shown that sample mean and sample variance are still the generalized estimators (Jones et al. or Basu et al.) for compactly supported Student distributions (that is, $\nu<-d$ ), but with the assumption that all samples are from true distribution. We showed that, in the presence of outliers, generalized estimator for the mean parameter might differ from the sample mean (Section IV-B). Next we found a class of generalized Hellinger estimators for the location and scale parameters of Cauchy distributions that involve a continuous estimate for the sample empirical measure. In particular, we found the estimators by using the uniform and Epanechnikov kernel density estimates for the empirical measure. We summarized this by a simulation study where we observed that these estimators are close to the true value of the parameters of Cauchy distributions when their mean and variance are finite. It is well-known that the MLE for Student or Cauchy distributions do not have a closed form solution. To overcome this, standard iterative methods such as Newton-Raphson, Gauss-Newton, EM are used in the literature [2, 29]. However, the sequence of estimators in these iterative methods may converge to a local maximum and the rate of convergence is also slow [2, 45]. Later some generalized iterative methods such as ECM, ECME were proposed, for example, in [45], where the rate of convergence was made faster than the previous methods. But again, they converge only to a local maximum. These difficulties can be overcome by some of the generalized estimators that we studied in this paper as they are not only robust but also have closed form solution.

Acknowledgements

Atin Gayen is supported by an INSPIRE fellowship of the Department of Science and Technology, Govt. of India. Part of this work was carried out when the authors were with the Indian Institute of Technology Indore. The authors would like to thank Professor Arup Bose for his constructive comments. The authors would also like to thank Professor Michel Broniatowski for the discussions they had with him during his visit to India through the VAJRA programme of Govt. of India. The authors also thank the Editor, Associate Editor and the referees for their valuable comments that improved the presentation of the paper.

Appendix: Projection theorem of density power divergence

The Projection theorem and the Pythagorean property of the more general class of Bregman divergences were established by Csiszár and Matúš [26] using tools from convex analysis. The density power divergence $B_{\alpha}$ is a subclass of the Bregman divergences. However, it is not easy to extract the results for the $B_{\alpha}$ -divergence from [26]. Ohara and Wada [52] also studied this by considering a specific form of the associated parametric family. In this section we derive the projection results for the $B_{\alpha}$ -divergence in the discrete case using some elementary tools. We must point out that the geometry of $B_{\alpha}$ -divergence is quite a natural extension of that of $I$ -divergence. Let $\mathbb{S}$ be a finite alphabet set and $\mathcal{P}:=\mathcal{P}(\mathbb{S})$ be the space of all probability distributions on $\mathbb{S}$ . Then for any $p,q\in\mathcal{P}$ , from (2), the $B_{\alpha}$ -divergence in the discrete case can be written as

[TABLE]

Let us also recall the definitions of reverse and forward projections given in (15) and (16). For $p\in\mathcal{P}$ , we shall denote the support of $p$ as $\text{Supp}(p)$ . For $\mathbb{C}\subset\mathcal{P}$ , $\text{Supp}(\mathbb{C})$ is defined as the union of support of members of $\mathbb{C}$ .

We now show the Pythagorean inequality of $B_{\alpha}$ -divergence in connection with the forward projection on a non-empty closed, convex set (hence compact, since $\mathbb{S}$ is finite). Thus the existence of forward projection always guaranteed, since $B_{\alpha}$ is lower semi-continuous [26, Lem. 2.12]. In the following we assume that $\text{Supp}(q)=\mathbb{S}$ .

Theorem 16

Let $p^{*}$ be the forward $B_{\alpha}$ -projection of $q$ on a closed and convex set $\mathbb{C}$ . Then

[TABLE]

Further if $\alpha<1$ , $\text{Supp}(\mathbb{C})=\text{Supp}(p^{*})$ .

Proof 9

Let $p\in\mathbb{C}$ and define, for $t\in[0,1]$ and $x\in\mathbb{S}$ ,

[TABLE]

Since $\mathbb{C}$ is convex, $p_{t}\in\mathbb{C}$ . By mean-value theorem, for each $t\in(0,1)$ ,

[TABLE]

From (87), we have

[TABLE]

Therefore (90) implies

[TABLE]

Hence, as $t\downarrow 0$ , we have

[TABLE]

which implies

[TABLE]

If $\text{Supp}(p^{*})\neq\text{Supp}(\mathbb{C})$ , then there exists $p\in\mathbb{C}$ and $x\in\mathbb{S}$ such that $p(x)>0$ but $p^{*}(x)=0$ . Hence if $\alpha<1$ , then the left-hand side of (91) goes to $-\infty$ as $t\downarrow 0$ , which contradicts (91). This proves the claim.

Remark 10

If $\alpha>1$ , in general, $\text{Supp}(p^{*})\neq\text{Supp}(\mathbb{C})$ . [43, Ex. 2] serves as a counterexample here as well. It follows from the following fact. Since $\mathbb{S}$ is finite, the $B_{\alpha}$ -divergence can be written as

[TABLE]

where $u$ is the uniform distribution on $\mathbb{S}$ and $|~{}\mathbb{S}~{}|$ denotes the cardinality of $\mathbb{S}$ . This implies

[TABLE]

where $H_{\alpha}(p):=\frac{1}{1-\alpha}\ln\sum_{x\in\mathbb{S}}p(x)^{\alpha}$ , the Rényi entropy of $p$ of order $\alpha$ . That is, forward $B_{\alpha}$ -projection of the uniform distribution on $\mathbb{C}$ is same as the maximizer of Rényi entropy on $\mathbb{C}$ . The same is true when $B_{\alpha}$ is replaced by $\mathscr{I}_{\alpha}$ or $D_{\alpha}$ .

We will now present a situation when the equality holds in (88).

Definition 17

The linear family, determined by $k$ real valued functions $f_{i},i\in\{1,\ldots,k\}$ on $\mathbb{S}$ and $k$ real numbers $a_{i},i\in\{1,\ldots,k\}$ , is defined as

[TABLE]

Theorem 18

Let $p^{*}$ be the forward $B_{\alpha}$ -projection of $q$ on $\mathbb{L}$ . The following hold.

(a)

If $\alpha<1$ then the Pythagorean equality holds, that is,

[TABLE] 2. (b)

If $\alpha>1$ and $\text{Supp}(p^{*})=\text{Supp}(\mathbb{L})$ then the Pythagorean equality (94) holds.

Proof 10

(a) Let $p_{t}$ be as in (89). Since $\text{Supp}(p^{*})=\text{Supp}(\mathbb{L})$ , there exists $t^{\prime}<0$ such that $p_{t}=(1-t)p^{*}+tp\in\mathbb{L}$ for $t\in(t^{\prime},0)$ . Hence, proceeding as in Theorem 16, for every $t\in(t^{\prime},0)$ , there exists $\tilde{t}\in(t,0)$ such that

[TABLE]

Hence we get (92) with a reversed inequality. Thus we have equality in (92). Hence we have (94).

(b) Similar to (a).

When $\alpha>1$ , equality in (94) does not hold in general. In the following we present an example where the equality in (94) does not hold.

Example 7

Let $\alpha=2$ , $\mathbb{S}=\{1,2,3,4\}$ and

[TABLE]

In view of Remark 10 and [43, Ex. 2], we see that $p^{*}=[3/4,1/4,0,0]^{t}op$ is the forward $B_{\alpha}$ -projection of the uniform distribution $u$ on $\mathbb{L}$ . However there exists $p\in\mathbb{L}$ , say $p=[157/200,97/600,1/50,1/30]^{t}op$ , that satisfies only the strict inequality in (88). The issue here is that $\text{Supp}(p^{*})$ $\subsetneq\text{Supp}(\mathbb{L})$ .

We now find an explicit expression of the forward $B_{\alpha}$ -projection in both the cases $\alpha<1$ and $\alpha>1$ separately.

Theorem 19

Let $q\in\mathcal{P}$ and let $\mathbb{L}$ be a linear family of probability distributions as in (93).

(a)

If $\alpha<1$ , the forward $B_{\alpha}$ -projection $p^{*}$ of $q$ on $\mathbb{L}$ satisfies

[TABLE]

with $\theta:=[\theta_{1},\dots,\theta_{k}]^{t}op$ , $f:=[f_{1},\dots,f_{k}]^{t}op$ where $\theta_{1},\dots,\theta_{k}$ are some scalars and $F$ is a constant. 2. (b)

If $\alpha>1$ , the forward $B_{\alpha}$ -projection $p^{*}$ of $q$ on $\mathbb{L}$ satisfies

[TABLE]

where $\theta$ , $f$ and $F$ are as in (a).

Proof 11

(a)

The proof is similar to that for $I$ -divergence in **[27, Th. 3.2]**. The linear family in Definition 17 can be re-written as

[TABLE]

Let $\mathbb{H}$ be the subspace of $\mathbb{R}^{|\text{Supp}(\mathbb{L})|}$ spanned by the $k$ vectors $f_{1}(\cdot)-a_{1},\dots,f_{k}(\cdot)-a_{k}$ . Then every $p\in\mathbb{L}$ can be thought of a $|\text{Supp}(\mathbb{L})|$ -dimensional vector in $\mathbb{H}^{\perp}$ . Hence $\mathbb{H}^{\perp}$ is a subspace of $\mathbb{R}^{|\text{Supp}(\mathbb{L})|}$ that contains a vector whose components are strictly positive since $p^{*}\in\mathbb{L}$ and $\text{Supp}(p^{*})=\text{Supp}(\mathbb{L})$ . It follows that $\mathbb{H}^{\perp}$ is spanned by its probability vectors. From (92) we see that (94) is equivalent to

[TABLE]

This implies that the vector

[TABLE]

Hence

[TABLE]

for some scalars $c_{1},\ldots,c_{k}$ . This implies (95) for appropriate choices of $F$ and $\theta_{1},\ldots,\theta_{k}$ .

(b)

The proof of this case is similar to that of $\mathscr{I}_{\alpha}$ -divergence **[43, Th. 14(b)]**. The optimization problem concerning the forward $B_{\alpha}$ -projection is

[TABLE]

Hence by **[8, Prop. 3.3.7]**, there exists Lagrange multipliers $\lambda_{1},\dots,\lambda_{k}$ , $\nu$ and $(\mu(x),$ $x\in\mathbb{S})$ , respectively, associated with the above constraints such that, for $x\in\mathbb{S}$ ,

[TABLE]

Since

[TABLE]

(103) can be re-written as

[TABLE]

Multiplying both sides by $p^{*}(x)$ and summing over all $x\in\mathbb{S}$ , we get

[TABLE]

For $x\in\text{Supp}(p^{*})$ , from (105), we must have $\mu(x)=0$ . Then, from (107), we have

[TABLE]

If $p^{*}(x)=0$ , from (107) we get

[TABLE]

Combining (108) and (109) we get (96).

Theorem 19 suggests us to define a parametric family of probability distributions that is a generalization of the usual exponential family. We call it a $\mathbb{B}^{(\alpha)}$ -family. First we formally define this family and then show an orthogonality relationship between this family and the linear family. As a consequence we will also show that the reverse $B_{\alpha}$ -projection on a $\mathbb{B}^{(\alpha)}$ -family is same as the forward projection on a linear family.

Definition 20

Let $q\in\mathcal{P}$ where $\text{Supp}(q)=\mathbb{S}$ for $\alpha>1$ and $f=[f_{1},\dots,f_{k}]^{t}op$ where $f_{i}$ for $i\in\{1,\dots,k\}$ be real valued function on $\mathbb{S}$ . The $k$ -parameter canonical $\mathbb{B}^{(\alpha)}:=\mathbb{B}^{(\alpha)}(q,f)$ family of probability distributions characterized by $q$ and $f$ is defined by $\mathbb{B}^{(\alpha)}=\{p_{\theta}:\theta\in\Theta\}\subset\mathcal{P}$ where

[TABLE]

for some $F:\Theta\to\mathbb{R}$ and $\Theta$ is the subset of $\mathbb{R}^{k}$ for which $p_{\theta}\in\mathcal{P}$ .

Remark 11

(a)

Observe that $\mathbb{B}^{(\alpha)}$ -family is a special case of the family $\mathcal{F}_{[\beta h]}$ in **[26, Eq. (28)]** with $h=q$ and $\beta(\cdot,t)=\frac{1}{\alpha-1}[t^{\alpha}-\alpha t+\alpha-t]$ .

(b)

The family depends on the reference measure $q$ only in a loose manner in the sense that any other member of the family can play the role of $q$ . The change of reference measure only corresponds to a translation of the parameter space. (This fact is true for the $\mathbb{M}^{(\alpha)}$ -family **[43, Prop. 22]**.)

The following theorem and its corollary together establish an “orthogonality” relationship between the $\mathbb{B}^{(\alpha)}$ -family and the associated linear family.

Theorem 21

Let $\alpha\in(0,1)$ . Consider a $\mathbb{B}^{(\alpha)}$ -family as in Definition 20 and let $\mathbb{L}$ be the corresponding linear family determined by the same functions $f_{i},i\in\{1,\ldots,k\}$ and some constants $a_{i},i\in\{1,\ldots,k\}$ as in (93). If $p^{*}$ is the forward $B_{\alpha}$ -projection of $q$ on $\mathbb{L}$ then we have the following:

(a)

$\mathbb{L}\cap\text{cl}(\mathbb{B}^{(\alpha)})=\{p^{*}\}$ * and*

[TABLE] 2. (b)

Further, if $\text{Supp}(\mathbb{L})=\mathbb{S}$ , then $\mathbb{L}\cap\mathbb{B}^{(\alpha)}=\{p^{*}\}$ .

Proof 12

By Theorem 19, the forward $B_{\alpha}$ -projection $p^{*}$ of $q$ on $\mathbb{L}$ is in $\mathbb{B}^{(\alpha)}$ . This implies that $p^{*}\in\mathbb{L}\cap\mathbb{B}^{(\alpha)}$ . Hence it suffices to prove the following:

(i)

Every $\tilde{p}\in\mathbb{L}\cap\text{cl}(\mathbb{B}^{(\alpha)})$ satisfies (94) with $\tilde{p}$ in place of $p^{*}$ . 2. (ii)

$\mathbb{L}\cap\text{cl}(\mathbb{B}^{(\alpha)})$ * is non-empty.*

We now proceed to prove both (i) and (ii).

(i) Let $\tilde{p}\in\mathbb{L}\cap\text{cl}(\mathbb{B}^{(\alpha)})$ . As $\tilde{p}\in\text{cl}(\mathbb{B}^{(\alpha)})$ , this implies that there exists a sequence $\{p_{n}\}\subset\mathbb{B}^{(\alpha)}$ such that $p_{n}\rightarrow\tilde{p}$ as $n\to\infty$ . Since $p_{n}\in\mathbb{B}^{(\alpha)}$ , we can write

[TABLE]

for some constants $\theta_{n}:=[\theta_{n}^{(1)},\dots,\theta_{n}^{(k)}]^{t}op\in\mathbb{R}^{k}$ and $F_{n}$ . Now for any $p\in\mathbb{L}$ we have, from the definition of linear family, $\sum_{x\in\mathbb{S}}p(x)f_{i}(x)=a_{i},i\in\{1,\ldots,k\}$ . Since $\tilde{p}\in\mathbb{L}$ , we also have $\sum\limits_{x\in\mathbb{S}}\tilde{p}(x)f_{i}(x)=a_{i},i\in\{1,\ldots,k\}$ . Multiplying both sides of (111) by $p$ and $\tilde{p}$ separately, we get

[TABLE]

and

[TABLE]

Combining the above two equations, we get

[TABLE]

As $n\to\infty$ , the above becomes

[TABLE]

which is equivalent to (94).

(ii) Let $p_{n}^{*}$ be the forward $B_{\alpha}$ -projection of $q$ on the linear family

[TABLE]

(see Figure 3).

By construction $\left(1-\frac{1}{n}\right)p+\frac{1}{n}q\in\mathbb{L}_{n}$ for any $p\in\mathbb{L}$ . Hence, since $\text{Supp}(q)=\mathbb{S}$ , we have $\text{Supp}(\mathbb{L}_{n})=\mathbb{S}$ . Since $\mathbb{L}_{n}$ is also characterized by the same functions $f_{i},i\in\{1,\ldots,k\}$ , we have $p_{n}^{*}\in\mathbb{B}^{(\alpha)}$ for every $n\in\mathbb{N}$ . Hence limit of any convergent sub-sequence of $\{p_{n}^{*}\}$ belongs to $\text{cl}(\mathbb{B}^{(\alpha)})\cap\mathbb{L}$ . Thus $\text{cl}(\mathbb{B}^{(\alpha)})\cap\mathbb{L}$ is non-empty. This completes the proof.

Corollary 22

Let $\alpha\in(0,1)$ . Let $\mathbb{L}$ and $\mathbb{B}^{(\alpha)}$ be characterized by the same functions $f_{i},i\in\{1,\ldots,k\}$ . Then $\mathbb{L}\cap\text{cl}(\mathbb{B}^{(\alpha)})=\{p^{*}\}$ and

[TABLE]

Proof 13

By Theorem 21, we have $\mathbb{L}\cap\text{cl}(\mathbb{B}^{(\alpha)})=\{p^{*}\}$ . In view of Remark 11(b), notice that every member of $\mathbb{B}^{(\alpha)}$ has the same projection on $\mathbb{L}$ , namely $p^{*}$ . Hence (94) holds for every $q\in\mathbb{B}^{(\alpha)}$ . Thus we only need to prove (94) for every $q\in\text{cl}(\mathbb{B}^{(\alpha)})\setminus\mathbb{B}^{(\alpha)}$ . Let $q\in\text{cl}(\mathbb{B}^{(\alpha)})\setminus\mathbb{B}^{(\alpha)}$ . There exists $\{q_{n}\}\subset\mathbb{B}^{(\alpha)}$ such that $q_{n}\rightarrow q$ . Hence for any $p\in\mathbb{L}$ , we have

[TABLE]

Since for a fixed $p$ , $q\mapsto B_{\alpha}(p,q)$ is continuous as a function from $\mathcal{P}$ to $[0,\infty]$ , taking limit as $n\to\infty$ on both sides of (113), we have

[TABLE]

This completes the proof.

Theorem 21 does not hold, in general, for $\alpha>1$ as shown in the following example.

Example 8

Let $\alpha,\mathbb{S},\mathbb{L}$ and $u$ be as in Example 7. Then the associated $\mathbb{B}^{(\alpha)}$ -family is given by

[TABLE]

where $f=[1,-3,-5,-6]^{t}op$ , $F(\theta)=\frac{13\theta}{4}$ and $\theta\in(-\frac{1}{17},\frac{1}{11})$ . Then we have

[TABLE]

where $p_{\theta}=\big{[}(\frac{1}{4}+\frac{17\theta}{4}),(\frac{1}{4}+\frac{\theta}{4}),(\frac{1}{4}-\frac{7\theta}{4}),(\frac{1}{4}-\frac{11\theta}{4})\big{]}^{t}op$ . If $p_{\theta}\in\text{cl}(\mathbb{B}^{(\alpha)})\cap\mathbb{L}$ then $\sum\limits_{x\in\mathbb{S}}p_{\theta}(x)f(x)$ $=0$ . This implies $\theta=\frac{13}{115}$ , which is outside the range of $\theta$ . Hence $\text{cl}(\mathbb{B}^{(\alpha)})\cap\mathbb{L}=\emptyset$ .

The following theorem tells us that a reverse $B_{\alpha}$ -projection on a $\mathbb{B}^{(\alpha)}$ -family can be turned into a forward $B_{\alpha}$ -projection on the associated linear family. We shall refer this as the projection theorem for the $B_{\alpha}$ -divergence. This theorem is analogous to the one for $I$ -divergence [27, Th. 3.3], $\mathscr{I}_{\alpha}$ -divergence [43, Th. 18] and $D_{\alpha}$ -divergence [41, Th. 6].

Theorem 23

Let $\alpha\in(0,1)$ . Let $X_{1}^{n}:=(X_{1},\dots,X_{n})\in\mathbb{S}^{n}$ . Let $p_{n}$ be the empirical probability measure of $X_{1}^{n}$ and let

[TABLE]

where $\bar{f_{i}}=\frac{1}{n}\sum_{j=1}^{n}f_{i}(X_{j}),i\in\{1,\ldots,k\}$ . Let $p^{*}$ be the forward $B_{\alpha}$ -projection of $q$ on $\widehat{\mathbb{L}}_{n}$ . Then the following hold.

(i)

If $p^{*}\in\mathbb{B}^{(\alpha)}$ , then $p^{*}$ is the reverse $B_{\alpha}$ -projection of $p_{n}$ on $\mathbb{B}^{(\alpha)}$ . 2. (ii)

If $p^{*}\notin\mathbb{B}^{(\alpha)}$ , then $p_{n}$ does not have a reverse $B_{\alpha}$ -projection on $\mathbb{B}^{(\alpha)}$ . However, $p^{*}$ is the reverse $B_{\alpha}$ -projection of $p_{n}$ on cl $(\mathbb{B}^{(\alpha)})$ .

Proof 14

Let us first observe that $\widehat{\mathbb{L}}_{n}$ is constructed so that $p_{n}\in\widehat{\mathbb{L}}_{n}$ . Since the families $\widehat{\mathbb{L}}_{n}$ and $\mathbb{B}^{(\alpha)}$ are defined by the same functions $f_{i}$ , $i\in\{1,\dots,k\}$ , by Corollary 22, we have $\widehat{\mathbb{L}}\cap\text{cl}(\mathbb{B}^{(\alpha)})=\{p^{*}\}$ and

[TABLE]

Hence it is clear that the minimizer of $B_{\alpha}(p_{n},q)$ over $q\in\text{cl}(\mathbb{B}^{(\alpha)})$ is same as the minimizer of $B_{\alpha}(p^{*},q)$ over $q\in\text{cl}(\mathbb{B}^{(\alpha)})$ (Notice that this statement is also true with $\text{cl}(\mathbb{B}^{(\alpha)})$ replaced by $\mathbb{B}^{(\alpha)}$ ). But $B_{\alpha}(p^{*},q)$ over $q\in\text{cl}(\mathbb{B}^{(\alpha)})$ is uniquely minimized by $q=p^{*}$ . Hence if $p^{*}\notin\mathbb{B}^{(\alpha)}$ , since minimum value of $B_{\alpha}(p_{n},q)$ over $q\in\text{cl}(\mathbb{B}^{(\alpha)})$ is same as that of $B_{\alpha}(p_{n},q)$ over $q\in\mathbb{B}^{(\alpha)}$ , the later is not attained on $\mathbb{B}^{(\alpha)}$ .

Remark 12

Theorems 21, 23, and Corollary 22 continue to hold for $\alpha>1$ as well if attention is restricted to probability measures with strictly positive components and the existence of $p^{*}$ is guaranteed.

Bibliography68

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] S. F. Ateya and E. A. Madhagi, “On multivariate truncated generalized cauchy distribution,” Stat. Papers , vol. 54, pp. 879–897, 2013.
2[2] V. D. Barnett, “Evaluation of the maximum-likelihood estimator where the likelihood equation has multiple roots,” Biometrika , vol. 53, pp. 151–165, 1966.
3[3] A. G. Bashkirov, “On maximum entropy principle, superstatistics, power-law distribution and rényi parameter,” Phys. A. , vol. 340, pp. 153–162, 2004.
4[4] A. Basu, S. Basu, and G. Chaudhury, “Robust minimum divergence procedures for count data models,” Sankhya: The Indian Journal of Statistic , vol. 59, pp. 11–27, 1997.
5[5] A. Basu, I. R. Harris, N. L. Hjort, and M. C. Jones, “Robust and efficient estimation by minimizing a density power divergence,” Biometrika , vol. 85, pp. 549–559, 1998.
6[6] A. Basu, H. Shioya, and C. Park, Statistical Inference: The Minimum Distance Approach . Chapman & Hall/ CRC Monographs on Statistics and Applied Probability 120, 2011.
7[7] R. Beran, “Minimum hellinger distance estimates for parametric models,” Ann. Statist. , vol. 5, pp. 445–463, 1977.
8[8] D. P. Bertsekas, Nonlinear Programming . 2nd ed. Belmont, MA: Athena Scientific, 2003.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Projection Theorems and Estimating Equations for Power-Law Models

Abstract

Index Terms:

I Introduction

II The power-law families: definition and examples

II-A The B(α)\mathbb{B}^{(\alpha)}B(α)-family

Definition 1

Example 1** (Student distributions)**

Theorem 2

Proof 1

Remark 1

Example 2

II-B The M(α)\mathbb{M}^{(\alpha)}M(α)-family

Definition 3

Example 3

Remark 2

Proposition 4

Proof 2

Corollary 5

Proof 3

Remark 3

Example 4

II-C The E(α)\mathscr{E}^{(\alpha)}E(α)-family

Definition 6

Remark 4

Lemma 7

Proof 4

Example 5** (Cauchy distributions)**

Example 6

Remark 5

III Projection theorems for general power-law families

Lemma 8

Proof 5

Corollary 9

Theorem 10

Proof 6

Remark 6

Corollary 11

Proof 7

Theorem 12

Proof 8

Corollary 13

IV Applications: Generalized estimation under Student and Cauchy distributions

IV-A Basu et al. [5] and Jones et al. [38] estimation under Student distributions

Theorem 14

Remark 7

IV-B Jones et al. estimation [38] under Student distributions for ν∉(0,∞)\nu\notin(0,\infty)ν∈/(0,∞)

Remark 8

IV-C Generalized Hellinger estimation under Cauchy distributions

Theorem 15

Remark 9

V Summary and concluding remarks

Acknowledgements

Appendix: Projection theorem of density power divergence

Theorem 16

Proof 9

Remark 10

Definition 17

Theorem 18

Proof 10

Example 7

Theorem 19

Proof 11

Definition 20

Remark 11

Theorem 21

Proof 12

Corollary 22

Proof 13

Example 8

Theorem 23

Proof 14

Remark 12

II-A The $\mathbb{B}^{(\alpha)}$ -family

Example 1 (Student distributions)

II-B The $\mathbb{M}^{(\alpha)}$ -family

II-C The $\mathscr{E}^{(\alpha)}$ -family

Example 5 (Cauchy distributions)

IV-B Jones et al. estimation [38] under Student distributions for $\nu\notin(0,\infty)$