Robust analogs to the Coefficient of Variation

Chandima N. P. G. Arachchige; Luke A. Prendergast; Robert G. Staudte

arXiv:1907.01110·math.ST·September 28, 2020

Robust analogs to the Coefficient of Variation

Chandima N. P. G. Arachchige, Luke A. Prendergast, Robert G. Staudte

PDF

TL;DR

This paper explores robust, quantile-based alternatives to the coefficient of variation for measuring relative dispersion, especially in the presence of outliers or skewed distributions, through theoretical analysis and simulations.

Contribution

It introduces and evaluates median-based and interquartile range-based measures as robust alternatives to the CV, addressing its sensitivity to outliers and skewness.

Findings

01

Quantile-based measures are more robust to outliers.

02

Median-based measures perform better with skewed data.

03

Simulation studies show improved coverage for proposed estimators.

Abstract

The coefficient of variation (CV) is commonly used to measure relative dispersion. However, since it is based on the sample mean and standard deviation, outliers can adversely affect the CV. Additionally, for skewed distributions the mean and standard deviation do not have natural interpretations and, consequently, neither does the CV. Here we investigate the extent to which quantile-based measures of relative dispersion can provide appropriate summary information as an alternative to the CV. In particular, we investigate two measures, the first being the interquartile range (in lieu of the standard deviation), divided by the median (in lieu of the mean), and the second being the median absolute deviation (MAD), divided by the median, as robust estimators of relative dispersion. In addition to comparing the influence functions of the competing estimators and their asymptotic biases and…

Tables8

Table 1. Table 1: A comparison of the CV , RCV Q subscript RCV 𝑄 \hbox{RCV}_{Q} and RCV M subscript RCV 𝑀 \hbox{RCV}_{M} for several distributions. LN refers to the log-normal distribution, WEI ( λ , α ) 𝜆 𝛼 (\lambda,\alpha) and PAR ( λ , α ) 𝜆 𝛼 (\lambda,\alpha) to the Weibull and Pareto Type II distributions with scale parameter λ 𝜆 \lambda and shape parameter α 𝛼 \alpha .

Distribution	CV	0.75* IQR/ $m$	1.4826*MAD/ $m$
Normal( $μ$ , $σ^{2}$ )	$\frac{σ}{μ}$	$\frac{3}{4} \frac{σ}{μ} [Φ^{- 1} (0.75) - Φ^{- 1} (0.25)]$	$\frac{σ}{μ}$
EXP( $λ$ )	1	1.189	1.030
Uniform $(a, b)$	$\frac{1}{\sqrt{3}} \cdot \frac{(b - a)}{(b + a)}$	$\frac{3}{4} \cdot \frac{(b - a)}{(b + a)}$	$\frac{1}{Φ^{- 1} (3 / 4)} \cdot \frac{(b - a)}{(b + a)}$
WEI( $λ$ , 1)	1	1.189	1.029
WEI( $λ$ , 2)	0.523	0.578	0.565
WEI( $λ$ , 5)	0.229	0.232	0.229
$χ_{2}^{2}$	1	1.189	1.030
$χ_{5}^{2}$	0.632	0.681	0.646
$χ_{ν \to \infty}^{2}$	$\to 0$	$\to 0$	$\to 0$
LN $(μ, 1)$	1.311	1.090	0.888
LN $(μ, 2)$	7.321	2.695	1.333
PAR $(λ, 2.5)$	2.236	1.453	1.120
PAR $(λ, 5)$	1.291	1.313	1.077

Table 2. Table 2: Desirable properties of measures of dispersion and their estimators. Here ‘+’, ‘0’ and ‘ − - ’ indicate the property always, sometimes or never holds.

Property	CV	${RCV}_{Q}$	${RCV}_{M}$
P1: Scale invariant	+	+	+
P2: Simple to understand	+	+	0
P3: Widely accepted and used	+	$0$	0
P4: Defined for all $F$	$0$ ¹¹1The CV is only defined if $F$ has a finite variance, but this is usually satisfied for diameter distribution models.	+	+
P5: Bounded influence function	$-$	+	+
Property	$\hat{CV}$	${\hat{RCV}}_{Q}$	${\hat{RCV}}_{M}$
P6: Consistency	0²²2Consistency and asymptotic normality require the existence of certain moments for $F$ .	+	+
P7: Asymptotic normality	0	+	+
P8: Standard error formula available	+	+	+
P9: Unaffected by 1% moderate outliers	0	+	+
P10: Unaffected by 1% extreme outliers	$-$	+	+
P11: Reliable coverage of confidence intervals	$-$	+	+

Table 3. Table 3: Relative ASD ( rASD ) comparisons for the estimators of CV , RCV Q subscript RCV 𝑄 \hbox{RCV}_{Q} and RCV M subscript RCV 𝑀 \hbox{RCV}_{M} for the N ( 5 , σ 2 ) 5 superscript 𝜎 2 (5,\sigma^{2}) , LN ( 0 , σ ) 0 𝜎 (0,\sigma) , EXP( λ 𝜆 \lambda ) and PAR ( α ) 𝛼 (\alpha) distributions.

Distribution		rASD for the	rASD for the	rASD for the
		CV estimator	${RCV}_{Q}$ estimator	${RCV}_{M}$ estimator
N $(5, σ^{2})$	$σ = 0.50$	0.714	1.173	1.173
	$σ = 1$	0.735	1.193	1.193
	$σ = 1.5$	0.768	1.225	1.225
	$σ = 2$	0.812	1.270	1.270
	$σ = 2.5$	0.866	1.324	1.324
	$σ = 3$	0.927	1.388	1.388
LN $(0, σ)$	$σ = 0.10$	0.721	1.172	1.164
	$σ = 0.25$	0.801	1.199	1.149
	$σ = 0.5$	1.151	1.294	1.098
	$σ = 0.75$	2.075	1.438	1.017
	$σ = 1$	4.674	1.621	0.914
	$σ = 1.5$	49.298	2.062	0.669
EXP $(λ)$	$λ$	1	1.594	0.950
PAR( $α$ )	$α = 0.50$	Undefined	3.223	0.419
	$α = 1$	Undefined	2.236	0.664
	$α = 1.5$	Undefined	1.976	0.735
	$α = 2$	Undefined	1.862	0.785
	$α = 2.5$	Undefined	1.799	0.816
	$α = 3$	Undefined	1.760	0.837
	$α = 4$	54.482	1.714	0.864
	$α = 4.5$	5.619	1.699	0.873
	$α = 5$	3.724	1.687	0.880
	$α = 5.5$	2.937	1.678	0.887
	$α = 6$	2.500	1.670	0.892
	$α = 6.5$	2.221	1.664	0.897

Table 4. Table 4: Simulated coverage probabilities (and widths) for 95% confidence interval estimators for RCV Q subscript RCV 𝑄 \hbox{RCV}_{Q} , Delta CV and the intervals for CV described in Section 2.1 . (* median widths reported due to excessively large average widths after back-exponentiation.)

Sample	Distribution	Panich	Med	Med	Gulhar	Inverse	Delta	${RCV}_{Q}$
Size(n)			Mill	MMcK	Method	Method	CV
50	N(5, 1)	0.927(0.08)	0.937(0.08)	0.941(0.08)	0.943(0.08)	0.838(0.06)	0.929(0.08)	0.979(0.16)
	LN(0, 1)	0.688(0.97)	0.817(1.03)	0.803(1.07)	0.508(0.48)	0.808(4.85)	0.997(7.81*)	0.983(1.30)
	EXP(1)	0.965(0.78)	0.978(0.73)	0.981(0.88)	0.887(0.40)	0.992(3.54)	0.997(0.68)	0.985(1.30)
	Chi(5)	0.954(0.34)	0.971(0.35)	0.966(0.36)	0.918(0.26)	0.999(0.76)	0.959(0.33)	0.977(0.58)
	PAR(1, 4)	0.746(1.12)	0.866(1.19)	0.836(1.22)	0.552(0.52)	0.720(2.97)	1.000(3.57E+9*)	0.985(1.70)
100	N(5, 1)	0.938(0.06)	0.949(0.06)	0.948(0.06)	0.943(0.06)	0.900(0.05)	0.938(0.06)	0.978(0.11)
	LN(0, 1)	0.755(0.85)	0.842(0.77)	0.867(0.96)	0.453(0.35)	0.926(2.69)	0.980(5.64)	0.975(0.82)
	EXP(1)	0.979(0.55)	0.988(0.52)	0.991(0.62)	0.863(0.28)	1.000(2.06)	0.983(0.43)	0.971(0.84)
	Chi(5)	0.966(0.24)	0.961(0.34)	0.975(0.26)	0.909(0.18)	1.000(0.59)	0.953(0.22)	0.971(0.39)
	PAR(1, 4)	0.812(0.99)	0.887(0.88)	0.914(1.11)	0.471(0.37)	0.890(1.79)	1.000(1.19E+6*)	0.978(1.07)
200	N(5, 1)	0.947(0.04)	0.946(0.04)	0.945(0.04)	0.940(0.04)	0.955(0.04)	0.942(0.04)	0.979(0.08)
	LN(0, 1)	0.783(0.67)	0.828(0.56)	0.892(0.75)	0.404(0.25)	0.979(2.46)	0.970(2.30)	0.967(0.55)
	EXP(1)	0.987(0.39)	0.988(0.37)	0.997(0.43)	0.850(0.20)	1.000(1.44)	0.974(0.29)	0.966(0.57)
	Chi(5)	0.976(0.17)	0.967(0.17)	0.978(0.18)	0.911(0.13)	1.000(0.47)	0.955(0.15)	0.968(0.27)
	PAR(1, 4)	0.822(0.78)	0.871(0.65)	0.929(0.87)	0.422(0.27)	0.970(4.20)	0.999(1.16E+4*)	0.969(0.71)
500	N(5, 1)	0.944(0.03)	0.949(0.03)	0.950(0.03)	0.944(0.02)	0.987(0.03)	0.950(0.03)	0.967(0.05)
	LN(0, 1)	0.792(0.44)	0.782(0.36)	0.923(0.49)	0.360(0.16)	0.998(2.21)	0.965(1.21)	0.961(0.33)
	EXP(1)	0.991(0.25)	0.960(0.23)	0.994(0.27)	0.841(0.12)	1.000(1.00)	0.959(0.18)	0.960(0.35)
	Chi(5)	0.976(0.11)	0.956(0.11)	0.966(0.11)	0.914(0.08)	1.000(0.36)	0.951(0.09)	0.960(0.17)
	PAR(1, 4)	0.833(0.52)	0.828(0.42)	0.952(0.58)	0.368(0.17)	0.995(1.85)	0.999(291.02*)	0.963(0.43)
1000	N(5, 1)	0.952(0.02)	0.949(0.02)	0.951(0.02)	0.943(0.02)	0.997(0.03)	0.954(0.02)	0.960(0.03)
	LN(0, 1)	0.751(0.31)	0.739(0.26)	0.874(0.35)	0.336(0.11)	0.999(1.51)	0.959(0.81)	0.959(0.23)
	EXP(1)	0.992(0.18)	0.884(0.16)	0.964(0.19)	0.834(0.09)	1.000(0.79)	0.955(0.13)	0.958(0.24)
	Chi(5)	0.979(0.08)	0.928(0.08)	0.950(0.08)	0.906(0.06)	1.000(0.29)	0.949(0.07)	0.958(0.12)
	PAR(1, 4)	0.797(0.37)	0.794(0.30)	0.923(0.41)	0.339(0.12)	0.998(1.65)	0.998(52.74*)	0.956(0.30)

Table 5. Table 5: Simulated Coverage probabilities (and widths) for 95 % percent \% bootstrap (non-parametric and parametric) confidence interval estimators for RCV M subscript RCV 𝑀 \hbox{RCV}_{M}

size(n)	Distribution	Non-parametric	Parametric	Asymptotic
Sample	Distribution	Method
50	N(5, 1)	0.9740(0.141)	0.9616(0.131)	0.9525(0.134)
	LN(0, 1)	0.9772(0.479)	0.9839(0.441)	0.9665(0.524)
	EXP(1)	0.9758(0.565)	0.9893(0.508)	0.9719(0.601)
	Chi(5)	0.9763(0.421)	0.9840(0.394)	0.9557(0.413)
	PAR(1, 4)	0.9777(0.549)	0.9874(0.493)	0.9751(0.619)
100	N(5, 1)	0.9759(0.099)	0.9795(0.093)	0.9493(0.094)
	LN(0, 1)	0.9749(0.337)	0.9859(0.327)	0.9673(0.370)
	EXP(1)	0.9762(0.402)	0.9946(0.374)	0.9648(0.411)
	Chi(5)	0.9738(0.296)	0.9776(0.284)	0.9588(0.291)
	PAR(1, 4)	0.9748(0.389)	0.9933(0.362)	0.9697(0.414)
200	N(5, 1)	0.9725(0.069)	0.9826(0.066)	0.9520(0.066)
	LN(0, 1)	0.9724(0.235)	0.9688(0.236)	0.9726(0.265)
	EXP(1)	0.9720(0.282)	0.9965(0.270)	0.9591(0.287)
	Chi(5)	0.9704(0.207)	0.9848(0.201)	0.9576(0.205)
	PAR(1, 4)	0.9729(0.272)	0.9903(0.261)	0.9681(0.283)
500	N(5, 1)	0.9644(0.043)	0.9851(0.042)	0.9505(0.042)
	LN(0, 1)	0.9668(0.147)	0.9257(0.150)	0.9757(0.169)
	EXP(1)	0.9624(0.177)	0.9962(0.173)	0.9564(0.180)
	Chi(5)	0.9678(0.129)	0.9877(0.127)	0.9574(0.129)
	PAR(1, 4)	0.9681(0.171)	0.9570(0.167)	0.9635(0.176)
1000	N(5, 1)	0.9582(0.030)	0.9861(0.029)	0.9495(0.030)
	LN(0, 1)	0.9616(0.103)	0.8247(0.106)	0.9793(0.120)
	EXP(1)	0.9612(0.124)	0.9757(0.123)	0.9569(0.128)
	Chi(5)	0.9640(0.091)	0.9834(0.090)	0.9571(0.092)
	PAR(1, 4)	0.9606(0.119)	0.8029(0.118)	0.9621(0.124)

Table 6. Table 6: Summary Statistics of number of doctor visits between Male and Female

Summary	Male	Female	Female
Statistic			(without outlier)
Sample Size	987	2079	2078
Minimum	0	0	0
1st Quartile	4	4	4
Median	8	8	8
Mean	12.08	12.8	12.45
3rd Quartile	14	15	15
Maximum	300	750	365

Table 7. Table 7: 95 % confidence interval lower bounds (LB) and upper bounds (UB) for the number of doctor visits.

Sample	CV	RCV_Q	RCV_M
Male	$(1.283, 2.016)$	$(0.837, 1.050)$	$(0.681, 0.807)$
Female	$(1.298, 2.801)$	$(0.943, 1.128)$	$(0.700, 0.786)$
Female, outlier excluded	$(1.237, 1.746)$	$(0.943, 1.128)$	$(0.699, 0.786)$

Table 8. Table 8: 95 % confidence interval lower bounds (LB) and upper bounds (UB) for ratios of CV , RCV Q subscript RCV 𝑄 \hbox{RCV}_{Q} and RCV M subscript RCV 𝑀 \hbox{RCV}_{M} between neighboring suburbs house prices.

Confidence	$x =$ Bundoora		$x =$ Black Rock		$x =$ Oakleigh
Interval	$y =$ Kingsbury		$y =$ Beaumaris		$y =$ Oakleigh East
Method	LB	UB	LB	UB	LB	UB
${CV}_{x} / {CV}_{y}$	1.0156	1.6079	0.6525	1.3225	0.7219	1.3519
${RCV}_{Q_{x}} / {RCV}_{Q_{y}}$	0.4336	0.9736	0.4844	0.9243	0.4607	1.0914
${RCV}_{M_{x}} / {RCV}_{M_{y}}$	0.5392	1.0808	0.5751	0.9366	0.5286	1.0218

Equations78

{[\frac{1}{cv} + z_{1 - α /2} (\frac{1}{n ^{1/2}})]^{- 1}, [\frac{1}{cv} - z_{1 - α /2} (\frac{1}{n ^{1/2}})]^{- 1}} .

{[\frac{1}{cv} + z_{1 - α /2} (\frac{1}{n ^{1/2}})]^{- 1}, [\frac{1}{cv} - z_{1 - α /2} (\frac{1}{n ^{1/2}})]^{- 1}} .

\leavevmode\resizebox{433.62pt}{}{$\left\{\widetilde{\hbox{cv}}-z_{1-\alpha/2}\sqrt{(n-1)^{-1}\widetilde{\hbox{cv}}^{2}\left(0.5+\widetilde{\hbox{cv}}^{2}\right)},\ \widetilde{\hbox{cv}}+z_{1-\alpha/2}\sqrt{(n-1)^{-1}\widetilde{\hbox{cv}}^{2}\left(0.5+\widetilde{\hbox{cv}}^{2}\right)}\right\}$}~{}.

\leavevmode\resizebox{433.62pt}{}{$\left\{\widetilde{\hbox{cv}}-z_{1-\alpha/2}\sqrt{(n-1)^{-1}\widetilde{\hbox{cv}}^{2}\left(0.5+\widetilde{\hbox{cv}}^{2}\right)},\ \widetilde{\hbox{cv}}+z_{1-\alpha/2}\sqrt{(n-1)^{-1}\widetilde{\hbox{cv}}^{2}\left(0.5+\widetilde{\hbox{cv}}^{2}\right)}\right\}$}~{}.

\leavevmode\resizebox{433.62pt}{}{$\left\{\widetilde{\hbox{cv}}\sqrt{\left(\frac{\chi_{n-1,1-\alpha/2}^{2}+2}{n}-1\right)\widetilde{\hbox{cv}}^{2}+\frac{\chi_{n-1,1-\alpha/2}^{2}}{n-1}},\ \widetilde{\hbox{cv}}\sqrt{\left(\frac{\chi_{n-1,\alpha/2}^{2}+2}{n}-1\right)\widetilde{\hbox{cv}}^{2}+\frac{\chi_{n-1,\alpha/2}^{2}}{n-1}}\right\}$}~{},

\leavevmode\resizebox{433.62pt}{}{$\left\{\widetilde{\hbox{cv}}\sqrt{\left(\frac{\chi_{n-1,1-\alpha/2}^{2}+2}{n}-1\right)\widetilde{\hbox{cv}}^{2}+\frac{\chi_{n-1,1-\alpha/2}^{2}}{n-1}},\ \widetilde{\hbox{cv}}\sqrt{\left(\frac{\chi_{n-1,\alpha/2}^{2}+2}{n}-1\right)\widetilde{\hbox{cv}}^{2}+\frac{\chi_{n-1,\alpha/2}^{2}}{n-1}}\right\}$}~{},

\leavevmode\resizebox{433.62pt}{}{$\left\{\tilde{k}\sqrt{\left(\frac{\chi_{n-1,1-\alpha/2}^{2}+2}{n}-1\right)\tilde{k}^{2}+\frac{\chi_{n-1,1-\alpha/2}^{2}}{n-1}},\ \tilde{k}\sqrt{\left(\frac{\chi_{n-1,\alpha/2}^{2}+2}{n}-1\right)\tilde{k}^{2}+\frac{\chi_{n-1,\alpha/2}^{2}}{n-1}}\right\}$}~{}.

\leavevmode\resizebox{433.62pt}{}{$\left\{\tilde{k}\sqrt{\left(\frac{\chi_{n-1,1-\alpha/2}^{2}+2}{n}-1\right)\tilde{k}^{2}+\frac{\chi_{n-1,1-\alpha/2}^{2}}{n-1}},\ \tilde{k}\sqrt{\left(\frac{\chi_{n-1,\alpha/2}^{2}+2}{n}-1\right)\tilde{k}^{2}+\frac{\chi_{n-1,\alpha/2}^{2}}{n-1}}\right\}$}~{}.

\frac{( n - 1 ) cv}{χ _{n - 1, 1 - α /2}^{2}}, \frac{( n - 1 ) cv}{χ _{n - 1, α /2}^{2}},

\frac{( n - 1 ) cv}{χ _{n - 1, 1 - α /2}^{2}}, \frac{( n - 1 ) cv}{χ _{n - 1, α /2}^{2}},

RCV_{Q} = 0.75 \times \frac{IQR}{m},

RCV_{Q} = 0.75 \times \frac{IQR}{m},

MAD = med ∣ x_{i} - m ∣,

MAD = med ∣ x_{i} - m ∣,

RCV_{M} = 1.4826 \times \frac{MAD}{m} .

RCV_{M} = 1.4826 \times \frac{MAD}{m} .

\hbox{IF}(x;\mathcal{T},F)=\lim_{\epsilon\downarrow 0}\frac{\mathcal{T}(F_{\epsilon})-\mathcal{T}(F)}{\epsilon}\equiv\frac{\partial}{\partial\epsilon}\mathcal{T}(F_{\epsilon})\Big{|}_{\epsilon=0}.

\hbox{IF}(x;\mathcal{T},F)=\lim_{\epsilon\downarrow 0}\frac{\mathcal{T}(F_{\epsilon})-\mathcal{T}(F)}{\epsilon}\equiv\frac{\partial}{\partial\epsilon}\mathcal{T}(F_{\epsilon})\Big{|}_{\epsilon=0}.

n Var [T (F_{n})] \to ASV (T, F) = E_{F} [IF^{2} (X; T, F)] .

n Var [T (F_{n})] \to ASV (T, F) = E_{F} [IF^{2} (X; T, F)] .

IF (x; C V, F) = CV [\frac{IF ( x ; V , F )}{2 σ ^{2}} - \frac{IF ( x ; M , F )}{μ}] .

IF (x; C V, F) = CV [\frac{IF ( x ; V , F )}{2 σ ^{2}} - \frac{IF ( x ; M , F )}{μ}] .

IF (x; ρ_{p, q}, F) = ρ_{p, q} {\frac{IF [ x ; G ( \cdot , p ) , F ]}{x _{p}} - \frac{IF [ x ; G ( \cdot , q ) , F ]}{x _{q}}} .

IF (x; ρ_{p, q}, F) = ρ_{p, q} {\frac{IF [ x ; G ( \cdot , p ) , F ]}{x _{p}} - \frac{IF [ x ; G ( \cdot , q ) , F ]}{x _{q}}} .

IF (x; R C V_{Q}, F) = 0.75 [IF (x; ρ_{3/4, 1/2}, F) - IF (x; ρ_{1/4, 1/2}, F)] .

IF (x; R C V_{Q}, F) = 0.75 [IF (x; ρ_{3/4, 1/2}, F) - IF (x; ρ_{1/4, 1/2}, F)] .

IF (x; M A D, Φ) = \frac{1}{4 Φ ^{- 1} ( 0.75 ) ϕ [ Φ ^{- 1} ( 0.75 ) ]} sign [∣ x ∣ - Φ^{- 1} (0.75)] .

IF (x; M A D, Φ) = \frac{1}{4 Φ ^{- 1} ( 0.75 ) ϕ [ Φ ^{- 1} ( 0.75 ) ]} sign [∣ x ∣ - Φ^{- 1} (0.75)] .

IF (x; R C V_{M}, F)

IF (x; R C V_{M}, F)

= \frac{IF ( x ; M A D , Φ _{μ} )}{m} - RCV_{M} \frac{IF ( x ; G ( \cdot , 1/2 ) , F )}{m} .

E [IF (X; C V, F)^{2}] =

E [IF (X; C V, F)^{2}] =

\displaystyle\qquad\qquad\quad-\frac{E\left[\hbox{IF}(X;\mathcal{V},F)\hbox{IF}(X;\mathcal{M},F)\right]}{\sigma^{2}\mu}\Bigg{\}}

ASV (C V, F) =

ASV (R C V_{Q}, F) =

ASV (R C V_{Q}, F) =

\displaystyle\qquad\qquad+\frac{g^{2}(1/2)}{m^{2}}-\frac{g(1/2)\left[g(3/4)-g(1/4)\right]}{m\times IQR}\Bigg{\}}~{}.

n [m (F_{n}) - F^{- 1} (1/2), M A D (F_{n}) - M A D (F)]^{⊤} \sim approx. N (0, Σ),

n [m (F_{n}) - F^{- 1} (1/2), M A D (F_{n}) - M A D (F)]^{⊤} \sim approx. N (0, Σ),

ρ_{1} = \frac{1}{4 f ^{2} ( F ^{- 1} ( 1/2 ))}, ρ_{2} = \frac{1}{4 C _{1}^{2}} [1 + \frac{C _{2}}{[ f ( F ^{- 1} ( 1/2 ) ] ^{2}}]

ρ_{1} = \frac{1}{4 f ^{2} ( F ^{- 1} ( 1/2 ))}, ρ_{2} = \frac{1}{4 C _{1}^{2}} [1 + \frac{C _{2}}{[ f ( F ^{- 1} ( 1/2 ) ] ^{2}}]

and ρ_{12} = \frac{1}{4 C _{1} f ( F ^{- 1} ( 1/2 ))} [1 - 4 F (F^{- 1} (1/2) - MAD) + \frac{C _{3}}{f ( F ^{- 1} ( 1/2 ))}]

and ρ_{12} = \frac{1}{4 C _{1} f ( F ^{- 1} ( 1/2 ))} [1 - 4 F (F^{- 1} (1/2) - MAD) + \frac{C _{3}}{f ( F ^{- 1} ( 1/2 ))}]

ASV (R C V_{M}, F) = RCV_{M}^{2} (\frac{ρ _{1}}{m ^{2}} + \frac{ρ _{2}}{MAD ^{2}} - \frac{2 ρ _{12}}{m \times MAD}) .

ASV (R C V_{M}, F) = RCV_{M}^{2} (\frac{ρ _{1}}{m ^{2}} + \frac{ρ _{2}}{MAD ^{2}} - \frac{2 ρ _{12}}{m \times MAD}) .

T (F_{n}) \pm z_{1 - α /2} ASD (T, F_{n}) / n,

T (F_{n}) \pm z_{1 - α /2} ASD (T, F_{n}) / n,

ASV (W, F) ≐ \frac{1}{[ T ( F ) ] ^{2}} ASV (T, F) .

ASV (W, F) ≐ \frac{1}{[ T ( F ) ] ^{2}} ASV (T, F) .

[L, U]_{CV} \equiv exp [ln (cv) \pm z_{1 - α /2} \frac{ASD ( C V , F _{n} )}{cv n}]

[L, U]_{CV} \equiv exp [ln (cv) \pm z_{1 - α /2} \frac{ASD ( C V , F _{n} )}{cv n}]

[L, U]_{RCV_{Q}} = exp [ln (rcv_{Q}) \pm z_{1 - α /2} \frac{ASD ( R C V _{Q} , F _{n} )}{rcv _{Q} n}] .

[L, U]_{RCV_{Q}} = exp [ln (rcv_{Q}) \pm z_{1 - α /2} \frac{ASD ( R C V _{Q} , F _{n} )}{rcv _{Q} n}] .

[L, U]_{RCV_{M}} = exp [ln (rcv_{M}) \pm z_{1 - α /2} \frac{ASD ( R C V _{M} , F _{n} )}{rcv _{M} n}] .

[L, U]_{RCV_{M}} = exp [ln (rcv_{M}) \pm z_{1 - α /2} \frac{ASD ( R C V _{M} , F _{n} )}{rcv _{M} n}] .

Q (p) = λ_{1} + \frac{1}{λ _{2}} (\frac{p ^{λ_{3}} - 1}{λ _{3}} - \frac{( 1 - p ) ^{λ_{4}} - 1}{λ _{4}}),

Q (p) = λ_{1} + \frac{1}{λ _{2}} (\frac{p ^{λ_{3}} - 1}{λ _{3}} - \frac{( 1 - p ) ^{λ_{4}} - 1}{λ _{4}}),

exp [ln (r) \pm z_{1 - α /2} {\frac{ASD ( R C V _{M, 1} , F _{n} )}{rcv _{M, 1} n _{1}} + \frac{ASD ( R C V _{M, 2} , F _{n} )}{rcv _{M, 2} n _{2}}}],

exp [ln (r) \pm z_{1 - α /2} {\frac{ASD ( R C V _{M, 1} , F _{n} )}{rcv _{M, 1} n _{1}} + \frac{ASD ( R C V _{M, 2} , F _{n} )}{rcv _{M, 2} n _{2}}}],

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Robust analogs to the Coefficient of Variation

Chandima N. P. G. Arachchige

Department of Mathematics and Statistics, La Trobe University

[email protected]

Luke A. Prendergast

Department of Mathematics and Statistics, La Trobe University

[email protected]

Robert G. Staudte

Department of Mathematics and Statistics, La Trobe University

[email protected]

Abstract

The coefficient of variation (CV) is commonly used to measure relative dispersion. However, since it is based on the sample mean and standard deviation, outliers can adversely affect the CV. Additionally, for skewed distributions the mean and standard deviation do not have natural interpretations and, consequently, neither does the CV. Here we investigate the extent to which quantile-based measures of relative dispersion can provide appropriate summary information as an alternative to the CV. In particular, we investigate two measures, the first being the interquartile range (in lieu of the standard deviation), divided by the median (in lieu of the mean), and the second being the median absolute deviation (MAD), divided by the median, as robust estimators of relative dispersion. In addition to comparing the influence functions of the competing estimators and their asymptotic biases and variances, we compare interval estimators using simulation studies to assess coverage.

Keywords: influence function, median absolute deviation, quantile density

1 Introduction

The coefficient of variation (CV), defined to be the ratio of the standard deviation to the mean, is the most commonly used method of measuring relative dispersion. It has applications in many areas, including engineering, physics, chemistry, medicine, economics and finance, to name just a few. For example, in analytical chemistry the CV is widely used to express the precision and repeatability of an assay (Reed *et al. *,, 2002). In finance the coefficient of variation is often considered useful in measuring relative risk (Miller & Karson,, 1977) where a test of the equality of the CVs for two stocks can be performed to compare risk. In economics, the CV is a summary statistic of inequality (e.g. Atkinson,, 1970; Chen & Fleisher,, 1996). Other examples use the CV to assess the homogeneity of bone test samples (Hamer *et al. *,, 1995), assessing strength of ceramics (Gong & Li,, 1999) and as a summary statistic to describe the development of age- and sex-specific cut off points for body-mass indexing in overweight children (Cole *et al. *,, 2000).

The lack of robustness to outliers of moment-based measures such as the mean and standard deviation has long been known. Almost a century ago Lovitt & Holtzclaw, (1929) proposed a measure called the “coefficient of variability ”based on the upper and lower quartiles ( $Q_{3}$ and $Q_{1}$ ). Promoted as an alternative to the CV, it was defined to be $(Q_{3}-Q_{1})/(Q_{3}+Q_{1})$ . Bonett, (2006) have since called this measure the “coefficient of quartile variation ” and introduced an interval estimator which exhibited good coverage even for small samples. This measure was recently re-investigated by Bulent & Hamza, (2018) and they have constructed bootstrap confidence intervals that typically provide conservative coverage. Another alternative measure is to take the ratio of the mean absolute deviation from the median divided by the median. This measure has applications in tax assessments (Gastwirth,, 1982) and confidence intervals have been considered by Bonett & Seier, (2005). The mean absolute deviation is still non-robust to outliers, and robustness can be improved (see e.g. Shapiro,, 2005; Reimann *et al. *,, 2008; Varmuza & Filzmoser,, 2009) by instead using the interquartile range (IQR) or the median absolute deviation (MAD).

For decades, interval estimation for the CV has attracted the attention of many researchers. For example, Gulhar *et al. *, (2012) compared no less than 15 parametric and non-parameic confidence interval estimators of the population CV. To the best of our knowledge interval estimators have not been introduced for the coefficient of variation based on the IQR and MAD. Therefore, given the obvious need for interval estimators that has attracted the interest for many others, one aim of this paper is to provide reliable interval estimators. We are motivated to do so by noting the excellent coverage achieved for measures based on ratios of quantiles, even for small samples (Prendergast & Staudte,, 2016b, 2017a, 2017b; Arachchige *et al. *,, 2019).

2 Notations and some selected methods

Let $X_{1},X_{2},......,X_{n}$ be an independent and identically distributed sample of size $n$ from a distribution with distribution function $F$ . Then the sample mean estimator is $\overline{X}=n^{-1}\sum^{n}_{i}X_{i}$ and sample variance estimator is $S^{2}=\sum_{i}^{n}(X_{i}-\overline{X})^{2}/(n-1)$ . The sample coefficient of variation estimator is then $\widehat{\hbox{CV}}=S/\overline{X}$ . Next let $\mathcal{F}\,$ be the class of all right-continuous cdfs on the positive axis; that is each $F\in\mathcal{F}\,$ satisfies $F(0)=0.$ For a sample denoted $x_{1},\ldots,x_{n}$ , the statistics $\overline{x}$ , $s$ , and $\widehat{\hbox{cv}}=s/\overline{x}$ are the observed values of the $\overline{X}$ , $S$ and $\widehat{\hbox{CV}}$ estimators above, and are therefore estimates of the unknown population parameters $\mu=\hbox{E}_{F}[X]$ , $\sigma=\sqrt{\hbox{E}_{F}[(X-\mu)^{2}]}$ and $\hbox{CV}=\sigma/\mu$ , assuming the first two moments of $F$ exist.

For each such $F\in\mathcal{F}\,$ define the associated left-continuous quantile function of $F$ by $Q(u)\equiv\inf\{x:\ F(x)\geq u\}$ , for $0<u<1.$ When the population $F$ is understood to be fixed but unknown, we sometimes simply write $x_{u}=Q(u)$ and write the corresponding estimators of these population quantiles as $\widehat{x}_{u}$ . We restrict attention to the quartiles $x_{0.25}$ , $x_{0.5}$ and $x_{0.75}$ , the sample estimates of which we denote $q_{1}$ , $m$ and $q_{3}$ for convenience.

2.1 Selected interval estimators of the CV

We begin by describing the inverse method (Sharma & Krishna,, 1994) for obtaining an interval estimator for the CV since it is perhaps the most naturally arising interval involving only basic principles. As additional methods for comparison later, we have chosen four of the 15 considered in Gulhar *et al. *, (2012) that exhibited comparatively good performance in terms of coverage.

While parametric interval estimators for the CV have typically been developed assuming an underlying normal distribution, such as those that we present below, for large sample sizes, they can also perform well (Gulhar *et al. *,, 2012) when there are deviations from normality due to the Central Limit Theorem.

The inverse method

Using the above notation, for suitably large $n$ , $\overline{x}/s$ is approximately $N(0,1/n)$ distributed. An approximate $(1-\alpha/2)\times 100$ % confidence interval for $\mu/\sigma$ is therefore $\overline{x}/s\pm z_{1-\alpha/2}/\sqrt{n}$ . Noting that $\mu/\sigma$ is simply the inverse of the population CV, an approximate 95% confidence interval for the CV can therefore be obtained by inverting this interval for $\mu/\sigma$ , giving (Sharma & Krishna,, 1994)

[TABLE]

Robustness of this interval estimator was recently re-investigated by Groeneveld, (2011).

The median-modified Miller interval (Med Mill)

The CV estimator has an approximate asymptotic normal distribution with mean CV and variance $(n-1)^{-1}\hbox{CV}^{2}(0.5+\hbox{CV}^{2})$ leading to an asymptotic interval proposed by Miller, (1991). In noting that the mean is a poor summary statistic of central location for skewed distributions, Gulhar *et al. *, (2012) proposed a median modification where the sample median replaces the sample mean in $s$ . Let $\tilde{s}=\sqrt{\frac{1}{n-1}\sum_{i=1}^{n}(x_{i}-m)^{2}}$ and $\widetilde{\hbox{cv}}=\tilde{s}/\overline{x}$ , the interval estimator is

[TABLE]

While simulations conducted by Gulhar *et al. *, (2012) using data sampled from a chi-square and gamma distribution showed typically good results for the Miller, (1991) interval, coverage was often better, if not at least similar, when using the median modification. With our interest mainly in skewed distributions, we focus on the median modified interval in (2.2).

Median modification of the modified McKay (Med MMcK)

Gulhar *et al. *, (2012) also introduced a median modification to the modified McKay interval (McKay,, 1932; Vangel,, 1996). The median-modified interval is

[TABLE]

where $\chi_{n-1,\alpha}^{2}$ is the $100\alpha$ -th percentile of a chi-square distribution with $(n-1)$ degrees of freedom. We focus on this median modified interval based on the results in Gulhar *et al. *, (2012).

The Panich method

Panichkitkosolkul, (2009) has further modified the Modified McKay (Vangel,, 1996) interval by replacing the sample CV with the maximum likelihood estimator for a normal distribution, $\tilde{k}=\sqrt{\sum_{i=1}^{n}(x_{i}-\overline{x})^{2}}/(\sqrt{n}\overline{x})$ . The interval is

[TABLE]

The Gulhar method

Using the fact that $(n-1)S^{2}/\sigma^{2}\sim\chi^{2}_{n-1}$ when data is sampled from the normal distribution, Gulhar *et al. *, (2012) proposed the interval,

[TABLE]

which compared favorably to the median-modified intervals for larger CV values. We therefore use this interval as one of the competitors.

2.2 Two robust versions of the CV

We now consider two robust alternatives for the CV that are based on quantiles. The denominator for the measures is the median, a preferred measure of centrality than the mean for skewed distributions.

2.2.1 A version based on the IQR

An option for the numerator is to use the interquartile range (IQR). Shapiro, (2005) gives this alternative as

[TABLE]

where the multiplicative factor 0.75 makes $\hbox{RCV}_{Q}$ comparable to the CV for a normal distribution. To the best of our knowledge there has been no research into interval estimators of the $\hbox{RCV}_{Q}$ and this will be one of our foci shortly.

2.2.2 A version based on the median absolute deviation

The median absolute deviation (Hampel,, 1974, MAD) is defined to be

[TABLE]

where, for ‘med’denoting median and i=1,…, $n$ . Using the MAD for relative dispersion has been recently proposed (e.g. Reimann *et al. *,, 2008; Varmuza & Filzmoser,, 2009) giving

[TABLE]

The multiplier $1.4826=1/\Phi^{-1}(3/4)$ , where $\Phi^{-1}$ denotes the quantile function for the $N(0,1)$ distribution, is used to achieve equivalence between $1.4826\times\hbox{MAD}/m$ and the standard deviation at the normal model. $1.4826\times\hbox{MAD}/m$ is commonly called the standardized MAD.

3 Some comparisons between the measures

The question of interest is, can we do just as well (or better) in assessing the relative dispersion by replacing the population concepts $\mu$ and $\sigma$ by the median $m=x_{0.5}$ and interquartile range $\hbox{IQR}=q_{3}-q_{1}$ or the MAD?

In Table 1 we compare the CV, $\hbox{RCV}_{Q}$ and $\hbox{RCV}_{M}$ for several distributions. In most cases, the results show an approximate equivalence between the three measures when the underlying population is normal and closer agreement between the two for many other distributions. Hereafter our main interest is comparing the concepts CV, $\hbox{RCV}_{Q}$ and $\hbox{RCV}_{M}$ and the natural estimators of them.

3.1 Properties

An essential property of a measure of relative dispersion is scale invariance. The CV is well-established, so competing measures should give roughly the same values when the underlying distribution is uni-modal and skewed to the right, As we have seen by examples, the plug-in estimator $s/\bar{x}$ of CV suffers from over-sensitivity to outliers. Table 2 provides a rough summary of results in this work.

In the next section, we briefly describe the methodology required to find standard errors and confidence intervals for CV, $\hbox{RCV}_{Q}$ and $\hbox{RCV}_{M}$ . We also investigate the robustness properties of the point estimators using theoretical methods and simulation studies and we illustrate our methods on a real data set. Finally, a summary and discussion of further possible work is in Section 6.

3.2 Influence functions

Consider a distribution function $F$ and suppose that a parameter of interest from $F$ is $\theta$ . Let $\mathcal{T}$ be a statistical function for estimator of $\theta$ such that $\mathcal{T}(F)=\theta$ and $\mathcal{T}(F_{n})=\widehat{\theta}$ , for $F_{n}$ denoting an empirical distribution function for sample of $n$ observations from $F$ , denotes an estimate of $\theta$ . Now, for $0\leq\epsilon\leq 1$ , define the ‘contamination’distribution $(F_{\epsilon})$ to have positive probability $\epsilon$ on $x$ (the contamination point) and $1-\epsilon$ on the distribution $F$ such that $F_{\epsilon}=(1-\epsilon)F+\epsilon\Delta_{x}$ where $\Delta_{x}$ denotes the distribution function that puts all of its mass at the point $x$ . The influence of the contamination on the estimator with functional $\mathcal{T}$ , relative to proportion of contamination, is $[\mathcal{T}(F_{\epsilon})-\mathcal{T}(F)]/\epsilon$ . The influence function (Hampel,, 1974) is then defined for each $x$ as

[TABLE]

A convenient way to appreciate the usefulness of the influence function in studying estimators is to consider the power series expansion $\mathcal{T}(F_{\epsilon})=T(F)+\epsilon\hbox{IF}(x;\mathcal{T},F)+O(\epsilon^{2})$ . So that, ignoring the error term $O(\epsilon^{2})$ which is negligible for small $\epsilon$ , increasing $\left|\hbox{IF}(x;\mathcal{T},F)\right|$ results in increasing influence of contamination on the estimator. Consequently, the influence function provides a very useful tool in the study of robustness of estimators.

One can show that (e.g., Hampel *et al. *,, 1986; Staudte & Sheather,, 1990) for $X\sim F$ , the mean and variance at $F$ of the random influence function are $\hbox{E}_{F}[\hbox{IF}(X;\,\mathcal{T},F)]=0$ and $\hbox{Var}_{F}[\hbox{IF}(X;\,\mathcal{T},F)]=\hbox{E}_{F}[\hbox{IF}^{2}(X;\,\mathcal{T},F)]$ . A reason for finding this last variance is that it arises in the asymptotic variance of the functional of $\mathcal{T}(F_{n})$ ; that is,

[TABLE]

3.2.1 Influence function of the CV

Let $\mathcal{M}$ and $\mathcal{V}$ denote the functional for the usual mean and variance estimators such that, at $F$ , $\mathcal{M}(F)=\int xdF=\mu$ and $\mathcal{V}(F)=\int\left[x-\mathcal{M}(F)\right]^{2}df=\sigma^{2}$ . The respective influence functions are $\hbox{IF}(x;\,\mathcal{M},F)=x-\mu$ and $\hbox{IF}(x;\,\mathcal{V},F)=(x-\mu)^{2}-\sigma^{2}$ . For convenience in notation, let $\mathcal{CV}$ also denote the functional for the CV. Groeneveld, (2011) derives the influence function as

[TABLE]

3.2.2 Influence function of the IQR-based RCV

The influence function of the $p$ th quantile $x_{p}=\mathcal{G}(F;p)=F^{-1}(p)$ is well-known (Staudte & Sheather,, 1990, p.59) to be $\hbox{IF}[x;\,\mathcal{G}(\,\cdot,p),F]=\{p-I[x_{p}\geq x]\}\,g(p)$ , where $\mathcal{G}^{\prime}(F;p)=g(p)=1/f(x_{p})$ is the quantile density of $\mathcal{G}$ at $p$ . The influence function of the ratio of two quantiles $\rho_{p,q}(F)=x_{p}/x_{q}=\mathcal{G}(\,\cdot,p)/\mathcal{G}(\,\cdot,q)$ is then found to be Prendergast & Staudte, (2017a):

[TABLE]

It then follows that the influence function of $\mathcal{RCV}_{Q}(F)=0.75\,\hbox{IQR}/m$ in terms of (3.3) is

[TABLE]

3.2.3 Influence function of the MAD-based RCV

Let $\mathcal{MAD}$ denote the functional for the standardized MAD. The influence function for the MAD estimator was described by Hampel, (1974) and its form for the standardized MAD for the standard normal distribution is (see, e.g., page 107 of Hampel *et al. *,, 1986)

[TABLE]

It is not suitable for us to study the influence function for $\mathcal{RCV}_{M}$ at the standard normal model since the median is equal to zero. However, the influence function for the standardized MAD for an arbitrary mean, $\mu$ , for the normal distribution is simply (3.5) shifted to be centred at $\mu$ and therefore equal to $\hbox{IF}(x;\,\mathcal{MAD},\Phi_{\mu})=\hbox{IF}(x-\mu;\,\mathcal{MAD},\Phi)$ where we let $\Phi_{\mu}$ denote the distribution function for the $N(\mu,1)$ distribution.

Let $\mathcal{RCV}_{M}$ be the statistical functional for the MAD-based RCV such that $\mathcal{RCV}_{M}(F)=\mathcal{MAD}(F)/\mathcal{G}(F,1/2)=\hbox{RCV}_{M}$ . Hence, using the Product Rule and the Chain Rule, the influence function for the RCVM estimator is

[TABLE]

The general form of the influence for the MAD can be found in, for example, page 137 of Huber, (1981), page 16 of Andersen, (2008) and page 37 of Wilcox, (2011) and this will be used to plot the influence functions for the non-Gaussian examples that follow.

3.2.4 Example influence function comparisons

To compute the true value for the MAD for the distributions being considered for influence function comparisons, and also when required later, we used the R function we have provided in Section B. Readers can use this code to compute the true MAD for any distributions.

In Plot A of Figure 1 we plot the influence functions for the three measures. The influence functions for the two robust measures are almost identical. In fact, it is know that the influence functions for the IQR and MAD are the same for the normal distribution (see page 110 of Hampel *et al. *,, 1986) so that the measures share the same robustness properties for this model. The differences in Figure 1 are due to the multiplier 0.75 for the IQR based measure chosen to give approximate equivalence, instead of exact, for the normal. However, this does not generalize to all distributions. As expected, the influence function for the CV is unbounded, meaning that outliers are expected to have uncapped influence on the estimator as they move further from the population mean. On the other hand, the influence functions for the robust measures are bounded. Extreme outliers are expected to have no more influence on the estimators when compared to, say, those closer to the 25% and 75% percentiles. However, the discontinuities at the median and the 25% and 75% percentiles, suggest that the estimators are more sensitive locally in these areas.

3.3 Asymptotic variances and standard deviations

In this section, we further compare the estimators by deriving their asymptotic variances. As discussed in Section 3.2, for an estimator with functional $\mathcal{T}$ , the asymptotic standard deviation can be found by $\hbox{ASD}(\mathcal{T},F)\equiv\sqrt{\hbox{ASV}(\mathcal{T},F)}=\sqrt{\{\hbox{E}_{F}[\hbox{IF}^{2}(X;\,\mathcal{T},F)]\}}$ . We now derive the ASVs for the estimators before comparing their relative asymptotic standard deviations.

3.3.1 Asymptotic Variance of the CV estimator

Recall $\mu=\mathcal{M}(F)$ is the mean for distribution $F$ and let $\mu_{k}=\hbox{E}_{F}[\{X-\mathcal{M}(F)\}^{k}]$ denotes the $k$ th central moment of $X\sim F$ where $\mu_{2}=\sigma^{2}=\mathcal{V}(F)$ denotes the variance. The influence function for the mean is $\hbox{IF}(x;\mathcal{M},F)=x-\mu$ and $E\left[\hbox{IF}(X;\mathcal{M},F)^{2}\right]=\sigma^{2}=\hbox{ASV}(\mathcal{M},F)$ , the asymptotic variance of the mean estimator. Similarly, $\hbox{IF}(x;\mathcal{V},F)=(x-\mu)^{2}-\sigma^{2}$ and $E\left[\hbox{IF}(X;\mathcal{V},F)^{2}\right]=\mu_{4}-\sigma^{4}=\hbox{ASV}(\mathcal{V},F)$ . Before deriving the ASV for the CV estimator, we note that $E\left[\hbox{IF}(X;\mathcal{M},F)\hbox{IF}(X;\mathcal{V},F)\right]$ , which is the asymptotic covariance between the mean and variance estimators, is equal to $\mu_{3}-\sigma^{2}$ . Now, from (3.2),

[TABLE]

assuming that the fourth moment exists.

Note that for $X\sim F$ , $\mu_{3}=0$ and $\mu_{4}=3\sigma^{4}$ so that $\hbox{ASV}(\mathcal{CV},F)=\hbox{CV}^{2}\left(1/2+\hbox{CV}^{2}\right)$ which is the asymptotic variance used by Miller, (1991) in the construction of the asymptotic interval for the CV detailed in Section 2.1.

3.3.2 Asymptotic Variance of the $\hbox{RCV}_{Q}$ estimator

The asymptotic variance of the estimator of $x_{p}$ , the $p$ -th quantile, is well known to be (eg. Ch.2 of David,, 1981; DasGupta,, 2006, Ch.3) $\hbox{ASV}\left(\mathcal{G},F;p\right)=p(1-p)g^{2}(p)$ where, as denoted earlier, $g(p)=1/f(x_{p})$ and $f$ is the density function. This can be verified also using $\hbox{E}\left[\hbox{IF}(X;\mathcal{G}(\cdot,p),F)^{2}\right]$ . Similarly, and as also found in the preceding references, the asymptotic covariance between the $p$ -th and $q$ -th quantile estimators is, $\hbox{E}\left[\hbox{IF}(X;\mathcal{G}(\cdot,p),F)\hbox{IF}(X;\mathcal{G}(\cdot,q),F)\right]=p(1-q)g(p)g(q)$ , provided $0<p<q<1$ .

Asymptotic variance for $\hbox{RCV}_{Q}=0.75\,\hbox{IQR}/m$ is obtained by a straightforward but lengthy derivation of $\hbox{E}\left[\hbox{IF}(X;\mathcal{RCV}_{Q},F)^{2}\right]$ with $\hbox{IF}(X;\mathcal{RCV}_{Q},F)$ defined in (3.4) (or by using the Delta method). After simplifying, it is

Theorem 3.1.

The asymptotic variance for the estimator of $\hbox{RCV}_{Q}$ is

[TABLE]

The proof of Theorem 3.1 is in Section A.

3.3.3 Asymptotic Variance of the $\hbox{RCV}_{M}$ estimator

Falk, (1997) proves the asymptotic joint normality of the $m(F_{n})$ and $\mathcal{MAD}(F_{n})$ estimators. Let $f=F^{\prime}$ be the density function associated with $F$ . If $F$ is continuous near and differentiable at $F^{-1}(1/2)$ , $F^{-1}(1/2)-\hbox{MAD}$ and $F^{-1}(1/2)+\hbox{MAD}$ with $f(F^{-1}(1/2))>0$ and $C1=f(F^{-1}(1/2)-\hbox{MAD})+f(F^{-1}(1/2)+\hbox{MAD})>0$ , then

[TABLE]

where ‘ $\stackrel{{\scriptstyle\text{\tiny approx.}}}{{\sim}}$ ’denotes ‘approximately distributed as for suitably large $n$ ’, $\mathbf{0}$ is a column vector zeroes and $\bm{\Sigma}$ is a two-dimensional covariance matrix with $\text{vec}(\bm{\Sigma})=[\rho_{1},\rho_{12},\rho_{12},\rho_{2}]$ . Hence, $\rho_{1}$ , $\rho_{2}$ are the asymptotic variances of the median and MAD estimators respectively and $\rho_{12}$ is the asymptotic covariance between the two. They are (e.g. Falk,, 1997),

[TABLE]

where $C_{3}=f(F^{-1}(1/2)-\hbox{MAD})-f(F^{-1}(1/2)+\hbox{MAD})$ and $C_{2}=C_{3}^{2}+4C_{3}f(F^{-1}(1/2))(1-F(F^{-1}(1/2)+\hbox{MAD})-F(F^{-1}(1/2)-\hbox{MAD}))$ .

Using the above results and the Delta method (see e.g. DasGupta,, 2006), we derived the asymptotic variance of the $\hbox{RCV}_{M}$ as given below,

[TABLE]

3.3.4 Relative asymptotic standard deviation comparisons

As an example, the asymptotic standard deviation (ASD) for the $\hbox{RCV}_{M}$ estimator is given as $\hbox{ASD}(\mathcal{RCV}_{M},F)=\sqrt{\hbox{ASV}(\mathcal{RCV}_{M},F)}$ and the ASDs for the other estimators are determined similarly. Later, we will construct approximate confidence intervals for the measures and therefore it make sense that we use the ASE for comparisons here. Since the CV, $\hbox{RCV}_{Q}$ and $\hbox{RCV}_{M}$ represent different values we use the relative (to the population parameter) ASD (RASE) to compare the estimators. For example, for the $\hbox{RCV}_{M}$ estimator this is defined to be $\hbox{rASD}(\mathcal{RCV}_{M},F)=\hbox{ASD}(\mathcal{RCV}_{M},F)/\mathcal{RCV}_{M}(F)$ .

To compare the rASD for the estimators of CV, $\hbox{RCV}_{Q}$ and $\hbox{RCV}_{M}$ , we have selected normal and lognormal distributions, both with varying $\sigma$ , exponential and the Pareto type II distribution with varying shape. From Table 3, the rASD for $\hbox{RCV}_{Q}$ and $\hbox{RCV}_{M}$ are a little higher than the rASD of CV for the normal distribution. However, $\hbox{RCV}_{Q}$ and $\hbox{RCV}_{M}$ estimators compare favorably to the CV for skewed distributions such as the lognormal and Pareto. The $p^{th}$ central moment of Pareto type II distribution exists only if $\alpha>p$ so that the rASD for the CV estimator is undefined for $\alpha<4$ since it requires the fourth central moment. When comparing $\hbox{RCV}_{q}$ and $\hbox{RCV}_{M}$ , the $\hbox{RCV}_{M}$ estimator is the better performer with smaller (or equal to in the case of the normal) rASD.

4 Inference

We want to compare point and interval estimators of $\hbox{CV}=\sigma/\mu$ , $\hbox{RCV}_{Q}=0.75\,\hbox{IQR}/x_{0.5}$ and $\hbox{RCV}_{M}=1.4826\,\hbox{MAD}/x_{0.5}$ . First, we introduce asymptotic Wald-type intervals using the asymptotic standard errors from earlier. With recent results highlighting very good coverage for estimators based on ratios of quantiles even for small samples (Prendergast & Staudte,, 2016b, 2017a, 2017b; Arachchige *et al. *,, 2019), we are confident of similarly good coverage for $\hbox{RCV}_{Q}$ . We also propose an asymptotic interval for $\hbox{RCV}_{M}$ as well as bootstrap intervals.

We estimate the $p$ th quantile $x_{p}=G(p)=F^{-1}(p)$ by the Hyndman & Fan, (1996) quantile estimator $\widehat{x}_{p}=\widehat{G}(p)$ , which is a linear combination of two adjacent order statistics. It is readily available as the Type 8 quantile estimator on the R software (Development Core Team,, 2008).

4.1 Asymptotic confidence intervals

Let $z_{\alpha}=\Phi^{-1}(\alpha)$ denote the $\alpha$ quantile of the standard normal distribution. All our 100( $1-\alpha$ )% confidence intervals for measures of relative spread $\mathcal{T}(F)$ will be of the form:

[TABLE]

where $\mathcal{T}(F_{n})$ is the estimator of $\mathcal{T}(F)$ and $\widehat{\hbox{ASD}}(\mathcal{T},F_{n})/\sqrt{n}\,$ is an estimate of its standard deviation (standard error) based on the sample. The actual coverage probability of this estimator depends on how quickly the distribution of $\mathcal{T}(F_{n})$ approaches normality, as well as the rate of convergence of $\mathcal{T}(F_{n})$ to $\mathcal{T}(F)$ and $\widehat{\hbox{ASD}}(\mathcal{T},F_{n})$ to $\hbox{ASD}(\mathcal{T},F).$

In constructing the interval estimators for the ratios, due to improved statistical performance such as quicker convergence to normality, it is common to first construct the interval for the log-transformed ratio followed by exponentiation to return to the original ratio scale. Let $W(F)=\ln[\mathcal{T}(F)]$ then, using the Delta Method (e.g. Ch.3 of DasGupta,, 2006),

[TABLE]

Then $\widehat{\hbox{ASD}}(W,F_{n})=\{\widehat{\hbox{ASV}}(W,F_{n})\}^{1/2}$ , where $\widehat{\hbox{ASV}}(W,F_{n}))$ is an estimate of the asymptotic variance, enables one to construct the confidence interval for $W(F)$ , which is based on the asymptotic normality of $W(F_{n})$ , before exponentiating to the original scale.

4.1.1 Confidence interval for CV

A $(1-\alpha)\times 100$ % confidence interval for the CV, which is based on the asymptotic normality of $\widehat{\hbox{CV}}$ when the first four moments of $F$ exist is

[TABLE]

and later we define this confidence interval method as “Delta CV ”in our simulation study. The ASV for the CV estimator is given in (3.7) and to obtain our asymptotic standard error we replace the population CV, $\sigma$ and $\mu$ with $\widehat{\hbox{cv}}$ , sample standard deviation $s$ and sample mean $\overline{x}$ respectively. To estimate $\mu_{j}$ (the $j$ th central moment) we use $n^{-1}\sum^{n}_{i=1}(x_{i}-\overline{x})^{j}$ .

4.1.2 Confidence interval for $\hbox{RCV}_{Q}$

A large-sample confidence interval for $\hbox{RCV}_{Q}=0.75\,\hbox{IQR}/m$ is in terms of the estimate $\widehat{\hbox{rcv}}_{Q}=0.75(\widehat{x}_{0.75}-\widehat{x}_{0.25})/\widehat{x}_{0.5}$

[TABLE]

The $\hbox{ASV}(\mathcal{RCV}_{Q},F))$ is given in Theorem 3.1 and to obtain $\widehat{\hbox{ASD}}(\mathcal{RCV}_{Q},F_{n})=\sqrt{\widehat{\hbox{ASV}}(\mathcal{RCV}_{Q},F_{n})}$ , one needs to replace each $x_{p}$ by $\widehat{x}_{p}$ and each $g(p)$ by $\widehat{g}(p)$ . For $\widehat{g}(p)$ , we use a kernel density estimator with the Epanechnikov, (1969) kernel and optimal bandwidth using the quantile optimality ratio of Prendergast & Staudte, (2016a).

4.1.3 Confidence interval for $\hbox{RCV}_{M}$

A large-sample confidence interval for $\hbox{RCV}_{M}=1.4826\,\hbox{MAD}/m$ is in terms of $\widehat{\hbox{rcv}}_{M}=1.4826\,\widehat{\hbox{MAD}}/\widehat{x}_{0.5}$ ,

[TABLE]

Estimation of the MAD is trivial, requiring only routine coding if functionality is not already available (i.e. it is simply the median of the ordered absolute differences of the $x_{i}$ s from the sample median). We also need to estimate $\rho_{1}$ , $\rho_{2}$ and $\rho_{12}$ in (3.8) and a simple approach using readily available software is use the FKML parameterization (Freimer *et al. *,, 1988) of the Generalized Lambda Distribution (GLD). Defined in terms of its quantile function

[TABLE]

where $\lambda_{i}$ $(i=1,\ldots,4)$ are location, inverse scale and two shape parameters, the GLD can approximate a very wide range of probability distributions (e.g. Karian & Dudewicz,, 2000; Dedduwakumara *et al. *,, 2019). To do so we use the method of moments estimators and density and quantile functions for the GLD in R gld package (King *et al. *,, 2016). It is then simple to estimate $\rho_{1}$ , $\rho_{2}$ and $\rho_{12}$ using the quantile and density functions with the estimated GLD parameters and the estimated MAD.

Additional to the asymptotic interval above, we also consider two bootstrap confidence intervals.

Non-parametric bootstrap

A non-parametric bootstrap re-samples $n$ observations with replacement from the sample and estimates the MAD. This is repeated $B$ times and let $\widehat{\hbox{MAD}}^{i}$ $(i=1,\ldots,B)$ denote the $i$ th estimated MAD. The lower and upper bounds for the 95% bootstrap interval is then the 0.025 and 0.975 quantiles of the estimated $\widehat{\hbox{MAD}}^{i}$ s.

Parametric bootstrap

The parametric bootstrap interval is obtained in the same way as the non-parametric bootstrap with the exception that the sampling is done from a nominated, or estimated, density function. In this case, we use the estimated density from the FKML GLD as described above for the asymptotic interval. This is called the Generalized Bootstrap by Dudewicz, (1992) who also uses the GLD, albeit with a different parameterization, as one example.

4.2 Confidence intervals for comparing two relative spreads

When data from two independent groups are available, it is straightforward to obtain interval estimators for the comparison of relative spread for each group. Given that empirical evidence suggests excellent coverage can be achieved in the single sample case by using a log transformation, we propose to use the log ratio of two independent relative spread estimators with a back exponentiation to the ratio scale. For example, an interval estimator for $\hbox{RCV}_{M,1}/\hbox{RCV}_{M,2}$ where $\hbox{RCV}_{M,1}$ and $\hbox{RCV}_{M,2}$ are the relative MAD-based spread for independent populations, is, where for simplicity $\widehat{r}=\widehat{\hbox{rcv}}_{M,1}/\widehat{\hbox{rcv}}_{M,2}$ ,

[TABLE]

where $n_{1}$ and $n_{2}$ are the sample sizes for simple random samples from the populations and where the estimates and asymptotic standard errors can be found as above for the single sample setting.

5 Simulations and Examples

5.1 Simulations

Firstly, a simulation study was conducted to compare the performance of the interval estimator of $\hbox{RCV}_{Q}$ and asymptotic CV interval given in 4.1 with the methods given in Section 2.1 using coverage probability and width as performance measures. We have selected normal (N), log normal (LN), exponential (EXP), chi-square ( $\chi^{2}$ ) and Pareto (PAR) distributions with different parameter choices and with sample sizes $n=\{50,100,200,500,1000\}$ . 10,000 simulation trials were used.

In Table 4 we provide the simulation results for the CV and RCVQ intervals. For simplicity, the RCVM results follow in Table 5 where the bootstrap and asymptotic intervals are compared. From Table 4, the Panich, Med Mill and Gulhar interval estimators for the CV perform really well for the normal distribution and when the sample size increases coverage reach to the nominal coverage. However, coverages was typically below nominal for skewed distributions pointing to unreliable performance of the estimators. The Delta CV interval of (4.1.1) provides improved coverage and close to nominal when the sample size increases, with the exception for the PAR(5,1) distribution for which the CV is undefined. The interval estimator for $\hbox{RCV}_{Q}$ was conservative being slightly above nominal for these simulations. The asymptotic interval for $\hbox{RCV}_{M}$ (Table 5) provide excellent coverage, even for $n=50$ and all distributions considered. With notable narrower intervals and very good coverage, the use of $\hbox{RCV}_{M}$ and associated asymptotic interval estimators using estimated GLD functions are practically enticing. However, there does not appear to be a benefit for using a bootstrap approach where coverage was typically more conservative.

5.1.1 A Shiny web application for the performance comparisons of the intervals

For further comparisons, we have developed a Shiny (Chang *et al. *,, 2017) web application that readers can use to run the simulations with different parameter choices. This can be found at https://lukeprendergast.shinyapps.io/Robust_CV/. The user can change the distribution, parameters, sample size, probability and the number of trials according to their choices. Once the desired options are selected, the ‘Run Simulation ’button can be pressed and the relevant estimates, coverage probability (cp) and the average width of the confidence interval (w) will be calculated according to their input choices. In addition to that in the bottom right hand corner of the web page it will shows the time taken to run the each simulation.

5.2 Examples

We have selected two different data sets, which are named as doctor visits data and Melbourne house price data to apply our findings to real world data.

5.2.1 Doctor visits data

We selected the doctor visits data set used in Heritier *et al. *, (2009) to apply our findings to a real world problem. The doctor visits data is a subsample of 3066 individuals of the AHEAD cohort (born before 1924) for wave 6 (year 2002) from the Health and Retirement Study (HRS) which surveys more than 22,000 Americans over the age of 50 every 2 years. We grouped this data in to two groups by taking the gender as the grouping variable. The response variable that we were interested is the number of doctor visits. Table 6 provides summary statistics of the response variable for the two gender groups.

From Table 6, the summary statistics suggest that the doctor visits distributions are positively skewed which is common for count variables. There is also a large outlier in the female group with a number of doctor visits equal to 750. We removed the outlier form the data set and again calculated the descriptive statistics for female group as shown in the $3^{rd}$ column of the above Table 6. The mean for the female group reduces after the removal of the outlier and the summary statistics still suggest positive skew.

Our objective was to compare the relative spread of the number of doctor visits between males and females. We used CV, $\hbox{RCV}_{Q}$ and $\hbox{RCV}_{M}$ to compare the relative spread of the number of doctor visits between males and females with and without an outlier.

Table 7 provides the confidence interval bounds of the 95 percent confidence intervals for the three measures. The confidence interval for CV is greatly influenced by whether or not the outlier in the female data is included. This is not the case for the interval for quantile-based measures. Additionally, in comparison, the interval CV is wide compared to the intervals for $\hbox{RCV}_{Q}$ and $\hbox{RCV}_{M}$ .

5.2.2 Melbourne house price data

The median is the most popular summary measure used to describe housing markets. Motivated by this, we applied our measures to Melbourne house clearance data from January 2016 which is available at https://www.kaggle.com/anthonypino/melbourne-housing-market. This data set contains suburb-wise prices for three types of houses (house, unit, townhouse). There is data for 369 suburbs and we removed the suburbs, which contain less than 10 houses sold leaving 301 suburbs.

We selected three pairs of suburbs which were considered by (Arachchige *et al. *,, 2019) to calculate the interval estimators for ratios CV, $\hbox{RCV}_{Q}$ and $\hbox{RCV}_{M}$ to assess differences in relative spread of house prices.

Figure 2 depicts there are outliers for all suburbs except for Kingsbury. Additionally, there are differences in spread for the house price distributions between each neighboring suburb.

Ratios of the measure are reported in Table 8 to see whether there is a difference in relative spread between suburbs. Comparing Bundoora and Kingsbury, the measures provide different insights. While the box plot suggests greater spread in Kingsbury, the ratio of CVs suggests otherwise having been highly influence by outliers in Bundoora. The ratios of RCVQ and RCVM suggest greater relative spread in Kingsbury which is in better agreement with what is shown in the box plots. For Beaumaris and Black Rock, a significant difference is not found for the CVs and the interval is wide. However, the other intervals suggest a significant difference. All three measures suggest there is not a significant difference in relative spread of house price between Oakleigh and Oakleigh East, although the intervals do tend to suggest that there is for RCVQ and RCVM. Overall, the intervals are narrower for the quantile-based measures having not been so greatly influence by outliers.

6 Summary and discussion

We have proposed interval estimators for alternative robust measures of relative spread to the coefficient of variation. RCVQ, a scalar multiple of the interquartile range divided by median, is simple and the associated confidence intervals have very good coverage over a diverse range of distribution types. Similarly, RCVM where the MAD is used instead of the interquartile range, interval also have excellent coverage and typically has smaller variability than the estimator for RCVQ making it a preferred candidate to be used instead of the CV. While we also considered bootstrap interval estimators for RCVM, the asymptotic Wald-type interval based on the approximate variances, and covariance between, the MAD and median achieved excellent coverage even for sample sizes as small as 50. These robust intervals compare very favorably to the CV where coverage is typically poor when the data is not sampled from a normal distribution. Our examples highlighted that they can provide very different insights into relative spread when compared to the CV, and the use of quantile-based measures is more easily justified when data is skewed due to difficulty interpreting the mean and variance.

Appendix A Proof of Theorem 3.1

Recall $\hbox{IF}(x;\,\rho_{p,q},F)$ and $\hbox{IF}(x;\,\mathcal{RCV}_{Q},F)$ in (3.3) and (3.4) respectively. For simplicity let $\hbox{IF}(x;\,\rho_{p,q},F)=\hbox{IF}_{\rho_{p,q}}$ , $\hbox{IF}(x;\,\mathcal{RCV}_{Q},F)=\hbox{IF}_{\mathcal{RCV}_{Q}}$ , $\hbox{IF}[\mathcal{G}(\,\cdot,p)]=\hbox{IF}_{\mathcal{G},p}$ and $\hbox{ASV}\left(\mathcal{G},F;p\right)=\hbox{ASV}_{\mathcal{G},p}$ . Then

[TABLE]

It can be shown,

[TABLE]

Similarly,

[TABLE]

and

[TABLE]

Substituting the above (A.2), (A.3),(A) in (A.1) and using $\hbox{ASV}\left(\mathcal{G},F;p\right)=p(1-p)g^{2}(p)$ gives

[TABLE]

Appendix B Computing the true MAD

Computing the true value of MAD is not a trivial task. We provide an R function below that can be uses to compute true value of the MAD for a user-specified distribution.

mad <- function(dist, param){

Computes the true value of the MAD for a specific

distribution with desired parameter choices.

Args:

dist: The distribution whose MAD

is to be calculated.

param: The parameter choices of the selected

distribution whose MAD is to be calculated.

Returns:

The true value of the MAD for a specific

distribution with desired parameter choices.

qf <- paste0("q", dist) m <- do.call(qf, c(p = 0.5, param)) # find median abs.x.m <- function(x, dist, param, m){ df <- paste0("d", dist) do.call(df, c(x = x + m, param)) + do.call(df, c(x = - x + m, param)) } abs.x.m.vec <- Vectorize(abs.x.m, "x")

f <- function(x, dist, param, m){ integrate(abs.x.m.vec, lower = 0, upper = x, dist = dist, param = param, m = m) $value - 0.5 } upper <- abs(do.call(qf, c(p = 0.75, param)) + m) uniroot(f, interval = c(0, upper), dist = dist, param = param, m = m)$ root } mad("lnorm", list(meanlog=0, sdlog=1)) mad("exp", list(rate=1))

Bibliography46

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Andersen, (2008) Andersen, R. 2008. Modern methods for robust regression . Sage.
2Arachchige et al. , (2019) Arachchige, C. NPG, Cairns, M., & Prendergast, L. A. 2019. Interval estimators for ratios of independent quantiles and interquantile ranges. Commun. Stat. B-Simul. (accepted, June) .
3Atkinson, (1970) Atkinson, A. B. 1970. On the measurement of inequality. J. Econ. Theor. , 2 (3), 244–263.
4Bonett, (2006) Bonett, D. G. 2006. Confidence interval for a coefficient of quartile variation. Comput. Stat. Data An. , 50 (11), 2953–2957.
5Bonett & Seier, (2005) Bonett, D. G., & Seier, E. 2005. Confidence interval for a coefficient of dispersion in nonnormal distributions. Biometrical J. , 47 (1), 144–148.
6Bulent & Hamza, (2018) Bulent, A., & Hamza, G. 2018. Bootstrap confidence intervals for the coefficient of quartile variation. Commun. Stat. B-Simul. , In Press , 1–9.
7Chang et al. , (2017) Chang, W., Cheng, J., Allaire, J. J., Xie, Y, & Mc Pherson, J. 2017. shiny: Web application framework for r . R package version 1.0.5.
8Chen & Fleisher, (1996) Chen, J., & Fleisher, B. M. 1996. Regional income inequality and economic growth in china. J. Comp. Econ. , 22 (2), 141–164.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Robust analogs to the Coefficient of Variation

Abstract

1 Introduction

2 Notations and some selected methods

2.1 Selected interval estimators of the CV

The inverse method

The median-modified Miller interval (Med Mill)

Median modification of the modified McKay (Med MMcK)

The Panich method

The Gulhar method

2.2 Two robust versions of the CV

2.2.1 A version based on the IQR

2.2.2 A version based on the median absolute deviation

3 Some comparisons between the measures

3.1 Properties

3.2 Influence functions

3.2.1 Influence function of the CV

3.2.2 Influence function of the IQR-based RCV

3.2.3 Influence function of the MAD-based RCV

3.2.4 Example influence function comparisons

3.3 Asymptotic variances and standard deviations

3.3.1 Asymptotic Variance of the CV estimator

3.3.2 Asymptotic Variance of the RCVQ\hbox{RCV}_{Q}RCVQ​ estimator

Theorem 3.1**.**

3.3.3 Asymptotic Variance of the RCVM\hbox{RCV}_{M}RCVM​ estimator

3.3.4 Relative asymptotic standard deviation comparisons

4 Inference

4.1 Asymptotic confidence intervals

4.1.1 Confidence interval for CV

4.1.2 Confidence interval for RCVQ\hbox{RCV}_{Q}RCVQ​

4.1.3 Confidence interval for RCVM\hbox{RCV}_{M}RCVM​

Non-parametric bootstrap

Parametric bootstrap

4.2 Confidence intervals for comparing two relative spreads

5 Simulations and Examples

5.1 Simulations

5.1.1 A Shiny web application for the performance comparisons of the intervals

5.2 Examples

5.2.1 Doctor visits data

5.2.2 Melbourne house price data

6 Summary and discussion

Appendix A Proof of Theorem 3.1

Appendix B Computing the true MAD

Computes the true value of the MAD for a specific

distribution with desired parameter choices.

Args:

dist: The distribution whose MAD

is to be calculated.

param: The parameter choices of the selected

distribution whose MAD is to be calculated.

Returns:

The true value of the MAD for a specific

distribution with desired parameter choices.

3.3.2 Asymptotic Variance of the $\hbox{RCV}_{Q}$ estimator

Theorem 3.1.

3.3.3 Asymptotic Variance of the $\hbox{RCV}_{M}$ estimator

4.1.2 Confidence interval for $\hbox{RCV}_{Q}$

4.1.3 Confidence interval for $\hbox{RCV}_{M}$