Robust analogs to the Coefficient of Variation
Chandima N. P. G. Arachchige, Luke A. Prendergast, Robert G. Staudte

TL;DR
This paper explores robust, quantile-based alternatives to the coefficient of variation for measuring relative dispersion, especially in the presence of outliers or skewed distributions, through theoretical analysis and simulations.
Contribution
It introduces and evaluates median-based and interquartile range-based measures as robust alternatives to the CV, addressing its sensitivity to outliers and skewness.
Findings
Quantile-based measures are more robust to outliers.
Median-based measures perform better with skewed data.
Simulation studies show improved coverage for proposed estimators.
Abstract
The coefficient of variation (CV) is commonly used to measure relative dispersion. However, since it is based on the sample mean and standard deviation, outliers can adversely affect the CV. Additionally, for skewed distributions the mean and standard deviation do not have natural interpretations and, consequently, neither does the CV. Here we investigate the extent to which quantile-based measures of relative dispersion can provide appropriate summary information as an alternative to the CV. In particular, we investigate two measures, the first being the interquartile range (in lieu of the standard deviation), divided by the median (in lieu of the mean), and the second being the median absolute deviation (MAD), divided by the median, as robust estimators of relative dispersion. In addition to comparing the influence functions of the competing estimators and their asymptotic biases and…
| Distribution | CV | 0.75* IQR/ | 1.4826*MAD/ |
|---|---|---|---|
| Normal(,) | |||
| EXP() | 1 | 1.189 | 1.030 |
| Uniform | |||
| WEI(, 1) | 1 | 1.189 | 1.029 |
| WEI(, 2) | 0.523 | 0.578 | 0.565 |
| WEI(, 5) | 0.229 | 0.232 | 0.229 |
| 1 | 1.189 | 1.030 | |
| 0.632 | 0.681 | 0.646 | |
| LN | 1.311 | 1.090 | 0.888 |
| LN | 7.321 | 2.695 | 1.333 |
| PAR | 2.236 | 1.453 | 1.120 |
| PAR | 1.291 | 1.313 | 1.077 |
| Property | CV | |||
| P1: Scale invariant | + | + | + | |
| P2: Simple to understand | + | + | 0 | |
| P3: Widely accepted and used | + | 0 | ||
| P4: Defined for all | 111The CV is only defined if has a finite variance, but this is usually satisfied for diameter distribution models. | + | + | |
| P5: Bounded influence function | + | + | ||
| Property | ||||
| P6: Consistency | 0222Consistency and asymptotic normality require the existence of certain moments for . | + | + | |
| P7: Asymptotic normality | 0 | + | + | |
| P8: Standard error formula available | + | + | + | |
| P9: Unaffected by 1% moderate outliers | 0 | + | + | |
| P10: Unaffected by 1% extreme outliers | + | + | ||
| P11: Reliable coverage of confidence intervals | + | + |
| Distribution | rASD for the | rASD for the | rASD for the | |
|---|---|---|---|---|
| CV estimator | estimator | estimator | ||
| N | 0.714 | 1.173 | 1.173 | |
| 0.735 | 1.193 | 1.193 | ||
| 0.768 | 1.225 | 1.225 | ||
| 0.812 | 1.270 | 1.270 | ||
| 0.866 | 1.324 | 1.324 | ||
| 0.927 | 1.388 | 1.388 | ||
| LN | 0.721 | 1.172 | 1.164 | |
| 0.801 | 1.199 | 1.149 | ||
| 1.151 | 1.294 | 1.098 | ||
| 2.075 | 1.438 | 1.017 | ||
| 4.674 | 1.621 | 0.914 | ||
| 49.298 | 2.062 | 0.669 | ||
| EXP | 1 | 1.594 | 0.950 | |
| PAR() | Undefined | 3.223 | 0.419 | |
| Undefined | 2.236 | 0.664 | ||
| Undefined | 1.976 | 0.735 | ||
| Undefined | 1.862 | 0.785 | ||
| Undefined | 1.799 | 0.816 | ||
| Undefined | 1.760 | 0.837 | ||
| 54.482 | 1.714 | 0.864 | ||
| 5.619 | 1.699 | 0.873 | ||
| 3.724 | 1.687 | 0.880 | ||
| 2.937 | 1.678 | 0.887 | ||
| 2.500 | 1.670 | 0.892 | ||
| 2.221 | 1.664 | 0.897 |
| Sample | Distribution | Panich | Med | Med | Gulhar | Inverse | Delta | |
|---|---|---|---|---|---|---|---|---|
| Size(n) | Mill | MMcK | Method | Method | CV | |||
| 50 | N(5, 1) | 0.927(0.08) | 0.937(0.08) | 0.941(0.08) | 0.943(0.08) | 0.838(0.06) | 0.929(0.08) | 0.979(0.16) |
| LN(0, 1) | 0.688(0.97) | 0.817(1.03) | 0.803(1.07) | 0.508(0.48) | 0.808(4.85) | 0.997(7.81*) | 0.983(1.30) | |
| EXP(1) | 0.965(0.78) | 0.978(0.73) | 0.981(0.88) | 0.887(0.40) | 0.992(3.54) | 0.997(0.68) | 0.985(1.30) | |
| Chi(5) | 0.954(0.34) | 0.971(0.35) | 0.966(0.36) | 0.918(0.26) | 0.999(0.76) | 0.959(0.33) | 0.977(0.58) | |
| PAR(1, 4) | 0.746(1.12) | 0.866(1.19) | 0.836(1.22) | 0.552(0.52) | 0.720(2.97) | 1.000(3.57E+9*) | 0.985(1.70) | |
| 100 | N(5, 1) | 0.938(0.06) | 0.949(0.06) | 0.948(0.06) | 0.943(0.06) | 0.900(0.05) | 0.938(0.06) | 0.978(0.11) |
| LN(0, 1) | 0.755(0.85) | 0.842(0.77) | 0.867(0.96) | 0.453(0.35) | 0.926(2.69) | 0.980(5.64) | 0.975(0.82) | |
| EXP(1) | 0.979(0.55) | 0.988(0.52) | 0.991(0.62) | 0.863(0.28) | 1.000(2.06) | 0.983(0.43) | 0.971(0.84) | |
| Chi(5) | 0.966(0.24) | 0.961(0.34) | 0.975(0.26) | 0.909(0.18) | 1.000(0.59) | 0.953(0.22) | 0.971(0.39) | |
| PAR(1, 4) | 0.812(0.99) | 0.887(0.88) | 0.914(1.11) | 0.471(0.37) | 0.890(1.79) | 1.000(1.19E+6*) | 0.978(1.07) | |
| 200 | N(5, 1) | 0.947(0.04) | 0.946(0.04) | 0.945(0.04) | 0.940(0.04) | 0.955(0.04) | 0.942(0.04) | 0.979(0.08) |
| LN(0, 1) | 0.783(0.67) | 0.828(0.56) | 0.892(0.75) | 0.404(0.25) | 0.979(2.46) | 0.970(2.30) | 0.967(0.55) | |
| EXP(1) | 0.987(0.39) | 0.988(0.37) | 0.997(0.43) | 0.850(0.20) | 1.000(1.44) | 0.974(0.29) | 0.966(0.57) | |
| Chi(5) | 0.976(0.17) | 0.967(0.17) | 0.978(0.18) | 0.911(0.13) | 1.000(0.47) | 0.955(0.15) | 0.968(0.27) | |
| PAR(1, 4) | 0.822(0.78) | 0.871(0.65) | 0.929(0.87) | 0.422(0.27) | 0.970(4.20) | 0.999(1.16E+4*) | 0.969(0.71) | |
| 500 | N(5, 1) | 0.944(0.03) | 0.949(0.03) | 0.950(0.03) | 0.944(0.02) | 0.987(0.03) | 0.950(0.03) | 0.967(0.05) |
| LN(0, 1) | 0.792(0.44) | 0.782(0.36) | 0.923(0.49) | 0.360(0.16) | 0.998(2.21) | 0.965(1.21) | 0.961(0.33) | |
| EXP(1) | 0.991(0.25) | 0.960(0.23) | 0.994(0.27) | 0.841(0.12) | 1.000(1.00) | 0.959(0.18) | 0.960(0.35) | |
| Chi(5) | 0.976(0.11) | 0.956(0.11) | 0.966(0.11) | 0.914(0.08) | 1.000(0.36) | 0.951(0.09) | 0.960(0.17) | |
| PAR(1, 4) | 0.833(0.52) | 0.828(0.42) | 0.952(0.58) | 0.368(0.17) | 0.995(1.85) | 0.999(291.02*) | 0.963(0.43) | |
| 1000 | N(5, 1) | 0.952(0.02) | 0.949(0.02) | 0.951(0.02) | 0.943(0.02) | 0.997(0.03) | 0.954(0.02) | 0.960(0.03) |
| LN(0, 1) | 0.751(0.31) | 0.739(0.26) | 0.874(0.35) | 0.336(0.11) | 0.999(1.51) | 0.959(0.81) | 0.959(0.23) | |
| EXP(1) | 0.992(0.18) | 0.884(0.16) | 0.964(0.19) | 0.834(0.09) | 1.000(0.79) | 0.955(0.13) | 0.958(0.24) | |
| Chi(5) | 0.979(0.08) | 0.928(0.08) | 0.950(0.08) | 0.906(0.06) | 1.000(0.29) | 0.949(0.07) | 0.958(0.12) | |
| PAR(1, 4) | 0.797(0.37) | 0.794(0.30) | 0.923(0.41) | 0.339(0.12) | 0.998(1.65) | 0.998(52.74*) | 0.956(0.30) |
| Sample | Distribution | Method | ||
|---|---|---|---|---|
| size(n) | Non-parametric | Parametric | Asymptotic | |
| 50 | N(5, 1) | 0.9740(0.141) | 0.9616(0.131) | 0.9525(0.134) |
| LN(0, 1) | 0.9772(0.479) | 0.9839(0.441) | 0.9665(0.524) | |
| EXP(1) | 0.9758(0.565) | 0.9893(0.508) | 0.9719(0.601) | |
| Chi(5) | 0.9763(0.421) | 0.9840(0.394) | 0.9557(0.413) | |
| PAR(1, 4) | 0.9777(0.549) | 0.9874(0.493) | 0.9751(0.619) | |
| 100 | N(5, 1) | 0.9759(0.099) | 0.9795(0.093) | 0.9493(0.094) |
| LN(0, 1) | 0.9749(0.337) | 0.9859(0.327) | 0.9673(0.370) | |
| EXP(1) | 0.9762(0.402) | 0.9946(0.374) | 0.9648(0.411) | |
| Chi(5) | 0.9738(0.296) | 0.9776(0.284) | 0.9588(0.291) | |
| PAR(1, 4) | 0.9748(0.389) | 0.9933(0.362) | 0.9697(0.414) | |
| 200 | N(5, 1) | 0.9725(0.069) | 0.9826(0.066) | 0.9520(0.066) |
| LN(0, 1) | 0.9724(0.235) | 0.9688(0.236) | 0.9726(0.265) | |
| EXP(1) | 0.9720(0.282) | 0.9965(0.270) | 0.9591(0.287) | |
| Chi(5) | 0.9704(0.207) | 0.9848(0.201) | 0.9576(0.205) | |
| PAR(1, 4) | 0.9729(0.272) | 0.9903(0.261) | 0.9681(0.283) | |
| 500 | N(5, 1) | 0.9644(0.043) | 0.9851(0.042) | 0.9505(0.042) |
| LN(0, 1) | 0.9668(0.147) | 0.9257(0.150) | 0.9757(0.169) | |
| EXP(1) | 0.9624(0.177) | 0.9962(0.173) | 0.9564(0.180) | |
| Chi(5) | 0.9678(0.129) | 0.9877(0.127) | 0.9574(0.129) | |
| PAR(1, 4) | 0.9681(0.171) | 0.9570(0.167) | 0.9635(0.176) | |
| 1000 | N(5, 1) | 0.9582(0.030) | 0.9861(0.029) | 0.9495(0.030) |
| LN(0, 1) | 0.9616(0.103) | 0.8247(0.106) | 0.9793(0.120) | |
| EXP(1) | 0.9612(0.124) | 0.9757(0.123) | 0.9569(0.128) | |
| Chi(5) | 0.9640(0.091) | 0.9834(0.090) | 0.9571(0.092) | |
| PAR(1, 4) | 0.9606(0.119) | 0.8029(0.118) | 0.9621(0.124) | |
| Summary | Male | Female | Female |
|---|---|---|---|
| Statistic | (without outlier) | ||
| Sample Size | 987 | 2079 | 2078 |
| Minimum | 0 | 0 | 0 |
| 1st Quartile | 4 | 4 | 4 |
| Median | 8 | 8 | 8 |
| Mean | 12.08 | 12.8 | 12.45 |
| 3rd Quartile | 14 | 15 | 15 |
| Maximum | 300 | 750 | 365 |
| Sample | CV | RCVQ | RCVM |
|---|---|---|---|
| Male | |||
| Female | |||
| Female, outlier excluded |
| Confidence | Bundoora | Black Rock | Oakleigh | |||
|---|---|---|---|---|---|---|
| Interval | Kingsbury | Beaumaris | Oakleigh East | |||
| Method | LB | UB | LB | UB | LB | UB |
| 1.0156 | 1.6079 | 0.6525 | 1.3225 | 0.7219 | 1.3519 | |
| 0.4336 | 0.9736 | 0.4844 | 0.9243 | 0.4607 | 1.0914 | |
| 0.5392 | 1.0808 | 0.5751 | 0.9366 | 0.5286 | 1.0218 | |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Robust analogs to the Coefficient of Variation
Chandima N. P. G. Arachchige
Department of Mathematics and Statistics, La Trobe University
Luke A. Prendergast
Department of Mathematics and Statistics, La Trobe University
Robert G. Staudte
Department of Mathematics and Statistics, La Trobe University
Abstract
The coefficient of variation (CV) is commonly used to measure relative dispersion. However, since it is based on the sample mean and standard deviation, outliers can adversely affect the CV. Additionally, for skewed distributions the mean and standard deviation do not have natural interpretations and, consequently, neither does the CV. Here we investigate the extent to which quantile-based measures of relative dispersion can provide appropriate summary information as an alternative to the CV. In particular, we investigate two measures, the first being the interquartile range (in lieu of the standard deviation), divided by the median (in lieu of the mean), and the second being the median absolute deviation (MAD), divided by the median, as robust estimators of relative dispersion. In addition to comparing the influence functions of the competing estimators and their asymptotic biases and variances, we compare interval estimators using simulation studies to assess coverage.
Keywords: influence function, median absolute deviation, quantile density
1 Introduction
The coefficient of variation (CV), defined to be the ratio of the standard deviation to the mean, is the most commonly used method of measuring relative dispersion. It has applications in many areas, including engineering, physics, chemistry, medicine, economics and finance, to name just a few. For example, in analytical chemistry the CV is widely used to express the precision and repeatability of an assay (Reed *et al. *,, 2002). In finance the coefficient of variation is often considered useful in measuring relative risk (Miller & Karson,, 1977) where a test of the equality of the CVs for two stocks can be performed to compare risk. In economics, the CV is a summary statistic of inequality (e.g. Atkinson,, 1970; Chen & Fleisher,, 1996). Other examples use the CV to assess the homogeneity of bone test samples (Hamer *et al. *,, 1995), assessing strength of ceramics (Gong & Li,, 1999) and as a summary statistic to describe the development of age- and sex-specific cut off points for body-mass indexing in overweight children (Cole *et al. *,, 2000).
The lack of robustness to outliers of moment-based measures such as the mean and standard deviation has long been known. Almost a century ago Lovitt & Holtzclaw, (1929) proposed a measure called the “coefficient of variability ”based on the upper and lower quartiles ( and ). Promoted as an alternative to the CV, it was defined to be . Bonett, (2006) have since called this measure the “coefficient of quartile variation ” and introduced an interval estimator which exhibited good coverage even for small samples. This measure was recently re-investigated by Bulent & Hamza, (2018) and they have constructed bootstrap confidence intervals that typically provide conservative coverage. Another alternative measure is to take the ratio of the mean absolute deviation from the median divided by the median. This measure has applications in tax assessments (Gastwirth,, 1982) and confidence intervals have been considered by Bonett & Seier, (2005). The mean absolute deviation is still non-robust to outliers, and robustness can be improved (see e.g. Shapiro,, 2005; Reimann *et al. *,, 2008; Varmuza & Filzmoser,, 2009) by instead using the interquartile range (IQR) or the median absolute deviation (MAD).
For decades, interval estimation for the CV has attracted the attention of many researchers. For example, Gulhar *et al. *, (2012) compared no less than 15 parametric and non-parameic confidence interval estimators of the population CV. To the best of our knowledge interval estimators have not been introduced for the coefficient of variation based on the IQR and MAD. Therefore, given the obvious need for interval estimators that has attracted the interest for many others, one aim of this paper is to provide reliable interval estimators. We are motivated to do so by noting the excellent coverage achieved for measures based on ratios of quantiles, even for small samples (Prendergast & Staudte,, 2016b, 2017a, 2017b; Arachchige *et al. *,, 2019).
2 Notations and some selected methods
Let be an independent and identically distributed sample of size from a distribution with distribution function . Then the sample mean estimator is and sample variance estimator is . The sample coefficient of variation estimator is then . Next let be the class of all right-continuous cdfs on the positive axis; that is each satisfies For a sample denoted , the statistics , , and are the observed values of the , and estimators above, and are therefore estimates of the unknown population parameters , and , assuming the first two moments of exist.
For each such define the associated left-continuous quantile function of by , for When the population is understood to be fixed but unknown, we sometimes simply write and write the corresponding estimators of these population quantiles as . We restrict attention to the quartiles , and , the sample estimates of which we denote , and for convenience.
2.1 Selected interval estimators of the CV
We begin by describing the inverse method (Sharma & Krishna,, 1994) for obtaining an interval estimator for the CV since it is perhaps the most naturally arising interval involving only basic principles. As additional methods for comparison later, we have chosen four of the 15 considered in Gulhar *et al. *, (2012) that exhibited comparatively good performance in terms of coverage.
While parametric interval estimators for the CV have typically been developed assuming an underlying normal distribution, such as those that we present below, for large sample sizes, they can also perform well (Gulhar *et al. *,, 2012) when there are deviations from normality due to the Central Limit Theorem.
The inverse method
Using the above notation, for suitably large , is approximately distributed. An approximate % confidence interval for is therefore . Noting that is simply the inverse of the population CV, an approximate 95% confidence interval for the CV can therefore be obtained by inverting this interval for , giving (Sharma & Krishna,, 1994)
[TABLE]
Robustness of this interval estimator was recently re-investigated by Groeneveld, (2011).
The median-modified Miller interval (Med Mill)
The CV estimator has an approximate asymptotic normal distribution with mean CV and variance leading to an asymptotic interval proposed by Miller, (1991). In noting that the mean is a poor summary statistic of central location for skewed distributions, Gulhar *et al. *, (2012) proposed a median modification where the sample median replaces the sample mean in . Let and , the interval estimator is
[TABLE]
While simulations conducted by Gulhar *et al. *, (2012) using data sampled from a chi-square and gamma distribution showed typically good results for the Miller, (1991) interval, coverage was often better, if not at least similar, when using the median modification. With our interest mainly in skewed distributions, we focus on the median modified interval in (2.2).
Median modification of the modified McKay (Med MMcK)
Gulhar *et al. *, (2012) also introduced a median modification to the modified McKay interval (McKay,, 1932; Vangel,, 1996). The median-modified interval is
[TABLE]
where is the -th percentile of a chi-square distribution with degrees of freedom. We focus on this median modified interval based on the results in Gulhar *et al. *, (2012).
The Panich method
Panichkitkosolkul, (2009) has further modified the Modified McKay (Vangel,, 1996) interval by replacing the sample CV with the maximum likelihood estimator for a normal distribution, . The interval is
[TABLE]
The Gulhar method
Using the fact that when data is sampled from the normal distribution, Gulhar *et al. *, (2012) proposed the interval,
[TABLE]
which compared favorably to the median-modified intervals for larger CV values. We therefore use this interval as one of the competitors.
2.2 Two robust versions of the CV
We now consider two robust alternatives for the CV that are based on quantiles. The denominator for the measures is the median, a preferred measure of centrality than the mean for skewed distributions.
2.2.1 A version based on the IQR
An option for the numerator is to use the interquartile range (IQR). Shapiro, (2005) gives this alternative as
[TABLE]
where the multiplicative factor 0.75 makes comparable to the CV for a normal distribution. To the best of our knowledge there has been no research into interval estimators of the and this will be one of our foci shortly.
2.2.2 A version based on the median absolute deviation
The median absolute deviation (Hampel,, 1974, MAD) is defined to be
[TABLE]
where, for ‘med’denoting median and i=1,…,. Using the MAD for relative dispersion has been recently proposed (e.g. Reimann *et al. *,, 2008; Varmuza & Filzmoser,, 2009) giving
[TABLE]
The multiplier , where denotes the quantile function for the distribution, is used to achieve equivalence between and the standard deviation at the normal model. is commonly called the standardized MAD.
3 Some comparisons between the measures
The question of interest is, can we do just as well (or better) in assessing the relative dispersion by replacing the population concepts and by the median and interquartile range or the MAD?
In Table 1 we compare the CV, and for several distributions. In most cases, the results show an approximate equivalence between the three measures when the underlying population is normal and closer agreement between the two for many other distributions. Hereafter our main interest is comparing the concepts CV, and and the natural estimators of them.
3.1 Properties
An essential property of a measure of relative dispersion is scale invariance. The CV is well-established, so competing measures should give roughly the same values when the underlying distribution is uni-modal and skewed to the right, As we have seen by examples, the plug-in estimator of CV suffers from over-sensitivity to outliers. Table 2 provides a rough summary of results in this work.
In the next section, we briefly describe the methodology required to find standard errors and confidence intervals for CV, and . We also investigate the robustness properties of the point estimators using theoretical methods and simulation studies and we illustrate our methods on a real data set. Finally, a summary and discussion of further possible work is in Section 6.
3.2 Influence functions
Consider a distribution function and suppose that a parameter of interest from is . Let be a statistical function for estimator of such that and , for denoting an empirical distribution function for sample of observations from , denotes an estimate of . Now, for , define the ‘contamination’distribution to have positive probability on (the contamination point) and on the distribution such that where denotes the distribution function that puts all of its mass at the point . The influence of the contamination on the estimator with functional , relative to proportion of contamination, is . The influence function (Hampel,, 1974) is then defined for each as
[TABLE]
A convenient way to appreciate the usefulness of the influence function in studying estimators is to consider the power series expansion . So that, ignoring the error term which is negligible for small , increasing results in increasing influence of contamination on the estimator. Consequently, the influence function provides a very useful tool in the study of robustness of estimators.
One can show that (e.g., Hampel *et al. *,, 1986; Staudte & Sheather,, 1990) for , the mean and variance at of the random influence function are and . A reason for finding this last variance is that it arises in the asymptotic variance of the functional of ; that is,
[TABLE]
3.2.1 Influence function of the CV
Let and denote the functional for the usual mean and variance estimators such that, at , and . The respective influence functions are and . For convenience in notation, let also denote the functional for the CV. Groeneveld, (2011) derives the influence function as
[TABLE]
3.2.2 Influence function of the IQR-based RCV
The influence function of the th quantile is well-known (Staudte & Sheather,, 1990, p.59) to be , where is the quantile density of at . The influence function of the ratio of two quantiles is then found to be Prendergast & Staudte, (2017a):
[TABLE]
It then follows that the influence function of in terms of (3.3) is
[TABLE]
3.2.3 Influence function of the MAD-based RCV
Let denote the functional for the standardized MAD. The influence function for the MAD estimator was described by Hampel, (1974) and its form for the standardized MAD for the standard normal distribution is (see, e.g., page 107 of Hampel *et al. *,, 1986)
[TABLE]
It is not suitable for us to study the influence function for at the standard normal model since the median is equal to zero. However, the influence function for the standardized MAD for an arbitrary mean, , for the normal distribution is simply (3.5) shifted to be centred at and therefore equal to where we let denote the distribution function for the distribution.
Let be the statistical functional for the MAD-based RCV such that . Hence, using the Product Rule and the Chain Rule, the influence function for the RCVM estimator is
[TABLE]
The general form of the influence for the MAD can be found in, for example, page 137 of Huber, (1981), page 16 of Andersen, (2008) and page 37 of Wilcox, (2011) and this will be used to plot the influence functions for the non-Gaussian examples that follow.
3.2.4 Example influence function comparisons
To compute the true value for the MAD for the distributions being considered for influence function comparisons, and also when required later, we used the R function we have provided in Section B. Readers can use this code to compute the true MAD for any distributions.
In Plot A of Figure 1 we plot the influence functions for the three measures. The influence functions for the two robust measures are almost identical. In fact, it is know that the influence functions for the IQR and MAD are the same for the normal distribution (see page 110 of Hampel *et al. *,, 1986) so that the measures share the same robustness properties for this model. The differences in Figure 1 are due to the multiplier 0.75 for the IQR based measure chosen to give approximate equivalence, instead of exact, for the normal. However, this does not generalize to all distributions. As expected, the influence function for the CV is unbounded, meaning that outliers are expected to have uncapped influence on the estimator as they move further from the population mean. On the other hand, the influence functions for the robust measures are bounded. Extreme outliers are expected to have no more influence on the estimators when compared to, say, those closer to the 25% and 75% percentiles. However, the discontinuities at the median and the 25% and 75% percentiles, suggest that the estimators are more sensitive locally in these areas.
3.3 Asymptotic variances and standard deviations
In this section, we further compare the estimators by deriving their asymptotic variances. As discussed in Section 3.2, for an estimator with functional , the asymptotic standard deviation can be found by . We now derive the ASVs for the estimators before comparing their relative asymptotic standard deviations.
3.3.1 Asymptotic Variance of the CV estimator
Recall is the mean for distribution and let denotes the th central moment of where denotes the variance. The influence function for the mean is and , the asymptotic variance of the mean estimator. Similarly, and . Before deriving the ASV for the CV estimator, we note that , which is the asymptotic covariance between the mean and variance estimators, is equal to . Now, from (3.2),
[TABLE]
assuming that the fourth moment exists.
Note that for , and so that which is the asymptotic variance used by Miller, (1991) in the construction of the asymptotic interval for the CV detailed in Section 2.1.
3.3.2 Asymptotic Variance of the estimator
The asymptotic variance of the estimator of , the -th quantile, is well known to be (eg. Ch.2 of David,, 1981; DasGupta,, 2006, Ch.3) where, as denoted earlier, and is the density function. This can be verified also using . Similarly, and as also found in the preceding references, the asymptotic covariance between the -th and -th quantile estimators is, , provided .
Asymptotic variance for is obtained by a straightforward but lengthy derivation of with defined in (3.4) (or by using the Delta method). After simplifying, it is
Theorem 3.1**.**
The asymptotic variance for the estimator of is
[TABLE]
The proof of Theorem 3.1 is in Section A.
3.3.3 Asymptotic Variance of the estimator
Falk, (1997) proves the asymptotic joint normality of the and estimators. Let be the density function associated with . If is continuous near and differentiable at , and with and , then
[TABLE]
where ‘’denotes ‘approximately distributed as for suitably large ’, is a column vector zeroes and is a two-dimensional covariance matrix with . Hence, , are the asymptotic variances of the median and MAD estimators respectively and is the asymptotic covariance between the two. They are (e.g. Falk,, 1997),
[TABLE]
[TABLE]
where and .
Using the above results and the Delta method (see e.g. DasGupta,, 2006), we derived the asymptotic variance of the as given below,
[TABLE]
3.3.4 Relative asymptotic standard deviation comparisons
As an example, the asymptotic standard deviation (ASD) for the estimator is given as and the ASDs for the other estimators are determined similarly. Later, we will construct approximate confidence intervals for the measures and therefore it make sense that we use the ASE for comparisons here. Since the CV, and represent different values we use the relative (to the population parameter) ASD (RASE) to compare the estimators. For example, for the estimator this is defined to be .
To compare the rASD for the estimators of CV, and , we have selected normal and lognormal distributions, both with varying , exponential and the Pareto type II distribution with varying shape. From Table 3, the rASD for and are a little higher than the rASD of CV for the normal distribution. However, and estimators compare favorably to the CV for skewed distributions such as the lognormal and Pareto. The central moment of Pareto type II distribution exists only if so that the rASD for the CV estimator is undefined for since it requires the fourth central moment. When comparing and , the estimator is the better performer with smaller (or equal to in the case of the normal) rASD.
4 Inference
We want to compare point and interval estimators of , and . First, we introduce asymptotic Wald-type intervals using the asymptotic standard errors from earlier. With recent results highlighting very good coverage for estimators based on ratios of quantiles even for small samples (Prendergast & Staudte,, 2016b, 2017a, 2017b; Arachchige *et al. *,, 2019), we are confident of similarly good coverage for . We also propose an asymptotic interval for as well as bootstrap intervals.
We estimate the th quantile by the Hyndman & Fan, (1996) quantile estimator , which is a linear combination of two adjacent order statistics. It is readily available as the Type 8 quantile estimator on the R software (Development Core Team,, 2008).
4.1 Asymptotic confidence intervals
Let denote the quantile of the standard normal distribution. All our 100()% confidence intervals for measures of relative spread will be of the form:
[TABLE]
where is the estimator of and is an estimate of its standard deviation (standard error) based on the sample. The actual coverage probability of this estimator depends on how quickly the distribution of approaches normality, as well as the rate of convergence of to and to
In constructing the interval estimators for the ratios, due to improved statistical performance such as quicker convergence to normality, it is common to first construct the interval for the log-transformed ratio followed by exponentiation to return to the original ratio scale. Let then, using the Delta Method (e.g. Ch.3 of DasGupta,, 2006),
[TABLE]
Then , where is an estimate of the asymptotic variance, enables one to construct the confidence interval for , which is based on the asymptotic normality of , before exponentiating to the original scale.
4.1.1 Confidence interval for CV
A % confidence interval for the CV, which is based on the asymptotic normality of when the first four moments of exist is
[TABLE]
and later we define this confidence interval method as “Delta CV ”in our simulation study. The ASV for the CV estimator is given in (3.7) and to obtain our asymptotic standard error we replace the population CV, and with , sample standard deviation and sample mean respectively. To estimate (the th central moment) we use .
4.1.2 Confidence interval for
A large-sample confidence interval for is in terms of the estimate
[TABLE]
The is given in Theorem 3.1 and to obtain , one needs to replace each by and each by . For , we use a kernel density estimator with the Epanechnikov, (1969) kernel and optimal bandwidth using the quantile optimality ratio of Prendergast & Staudte, (2016a).
4.1.3 Confidence interval for
A large-sample confidence interval for is in terms of ,
[TABLE]
Estimation of the MAD is trivial, requiring only routine coding if functionality is not already available (i.e. it is simply the median of the ordered absolute differences of the s from the sample median). We also need to estimate , and in (3.8) and a simple approach using readily available software is use the FKML parameterization (Freimer *et al. *,, 1988) of the Generalized Lambda Distribution (GLD). Defined in terms of its quantile function
[TABLE]
where are location, inverse scale and two shape parameters, the GLD can approximate a very wide range of probability distributions (e.g. Karian & Dudewicz,, 2000; Dedduwakumara *et al. *,, 2019). To do so we use the method of moments estimators and density and quantile functions for the GLD in R gld package (King *et al. *,, 2016). It is then simple to estimate , and using the quantile and density functions with the estimated GLD parameters and the estimated MAD.
Additional to the asymptotic interval above, we also consider two bootstrap confidence intervals.
Non-parametric bootstrap
A non-parametric bootstrap re-samples observations with replacement from the sample and estimates the MAD. This is repeated times and let denote the th estimated MAD. The lower and upper bounds for the 95% bootstrap interval is then the 0.025 and 0.975 quantiles of the estimated s.
Parametric bootstrap
The parametric bootstrap interval is obtained in the same way as the non-parametric bootstrap with the exception that the sampling is done from a nominated, or estimated, density function. In this case, we use the estimated density from the FKML GLD as described above for the asymptotic interval. This is called the Generalized Bootstrap by Dudewicz, (1992) who also uses the GLD, albeit with a different parameterization, as one example.
4.2 Confidence intervals for comparing two relative spreads
When data from two independent groups are available, it is straightforward to obtain interval estimators for the comparison of relative spread for each group. Given that empirical evidence suggests excellent coverage can be achieved in the single sample case by using a log transformation, we propose to use the log ratio of two independent relative spread estimators with a back exponentiation to the ratio scale. For example, an interval estimator for where and are the relative MAD-based spread for independent populations, is, where for simplicity ,
[TABLE]
where and are the sample sizes for simple random samples from the populations and where the estimates and asymptotic standard errors can be found as above for the single sample setting.
5 Simulations and Examples
5.1 Simulations
Firstly, a simulation study was conducted to compare the performance of the interval estimator of and asymptotic CV interval given in 4.1 with the methods given in Section 2.1 using coverage probability and width as performance measures. We have selected normal (N), log normal (LN), exponential (EXP), chi-square () and Pareto (PAR) distributions with different parameter choices and with sample sizes . 10,000 simulation trials were used.
In Table 4 we provide the simulation results for the CV and RCVQ intervals. For simplicity, the RCVM results follow in Table 5 where the bootstrap and asymptotic intervals are compared. From Table 4, the Panich, Med Mill and Gulhar interval estimators for the CV perform really well for the normal distribution and when the sample size increases coverage reach to the nominal coverage. However, coverages was typically below nominal for skewed distributions pointing to unreliable performance of the estimators. The Delta CV interval of (4.1.1) provides improved coverage and close to nominal when the sample size increases, with the exception for the PAR(5,1) distribution for which the CV is undefined. The interval estimator for was conservative being slightly above nominal for these simulations. The asymptotic interval for (Table 5) provide excellent coverage, even for and all distributions considered. With notable narrower intervals and very good coverage, the use of and associated asymptotic interval estimators using estimated GLD functions are practically enticing. However, there does not appear to be a benefit for using a bootstrap approach where coverage was typically more conservative.
5.1.1 A Shiny web application for the performance comparisons of the intervals
For further comparisons, we have developed a Shiny (Chang *et al. *,, 2017) web application that readers can use to run the simulations with different parameter choices. This can be found at https://lukeprendergast.shinyapps.io/Robust_CV/. The user can change the distribution, parameters, sample size, probability and the number of trials according to their choices. Once the desired options are selected, the ‘Run Simulation ’button can be pressed and the relevant estimates, coverage probability (cp) and the average width of the confidence interval (w) will be calculated according to their input choices. In addition to that in the bottom right hand corner of the web page it will shows the time taken to run the each simulation.
5.2 Examples
We have selected two different data sets, which are named as doctor visits data and Melbourne house price data to apply our findings to real world data.
5.2.1 Doctor visits data
We selected the doctor visits data set used in Heritier *et al. *, (2009) to apply our findings to a real world problem. The doctor visits data is a subsample of 3066 individuals of the AHEAD cohort (born before 1924) for wave 6 (year 2002) from the Health and Retirement Study (HRS) which surveys more than 22,000 Americans over the age of 50 every 2 years. We grouped this data in to two groups by taking the gender as the grouping variable. The response variable that we were interested is the number of doctor visits. Table 6 provides summary statistics of the response variable for the two gender groups.
From Table 6, the summary statistics suggest that the doctor visits distributions are positively skewed which is common for count variables. There is also a large outlier in the female group with a number of doctor visits equal to 750. We removed the outlier form the data set and again calculated the descriptive statistics for female group as shown in the column of the above Table 6. The mean for the female group reduces after the removal of the outlier and the summary statistics still suggest positive skew.
Our objective was to compare the relative spread of the number of doctor visits between males and females. We used CV, and to compare the relative spread of the number of doctor visits between males and females with and without an outlier.
Table 7 provides the confidence interval bounds of the 95 percent confidence intervals for the three measures. The confidence interval for CV is greatly influenced by whether or not the outlier in the female data is included. This is not the case for the interval for quantile-based measures. Additionally, in comparison, the interval CV is wide compared to the intervals for and .
5.2.2 Melbourne house price data
The median is the most popular summary measure used to describe housing markets. Motivated by this, we applied our measures to Melbourne house clearance data from January 2016 which is available at https://www.kaggle.com/anthonypino/melbourne-housing-market. This data set contains suburb-wise prices for three types of houses (house, unit, townhouse). There is data for 369 suburbs and we removed the suburbs, which contain less than 10 houses sold leaving 301 suburbs.
We selected three pairs of suburbs which were considered by (Arachchige *et al. *,, 2019) to calculate the interval estimators for ratios CV, and to assess differences in relative spread of house prices.
Figure 2 depicts there are outliers for all suburbs except for Kingsbury. Additionally, there are differences in spread for the house price distributions between each neighboring suburb.
Ratios of the measure are reported in Table 8 to see whether there is a difference in relative spread between suburbs. Comparing Bundoora and Kingsbury, the measures provide different insights. While the box plot suggests greater spread in Kingsbury, the ratio of CVs suggests otherwise having been highly influence by outliers in Bundoora. The ratios of RCVQ and RCVM suggest greater relative spread in Kingsbury which is in better agreement with what is shown in the box plots. For Beaumaris and Black Rock, a significant difference is not found for the CVs and the interval is wide. However, the other intervals suggest a significant difference. All three measures suggest there is not a significant difference in relative spread of house price between Oakleigh and Oakleigh East, although the intervals do tend to suggest that there is for RCVQ and RCVM. Overall, the intervals are narrower for the quantile-based measures having not been so greatly influence by outliers.
6 Summary and discussion
We have proposed interval estimators for alternative robust measures of relative spread to the coefficient of variation. RCVQ, a scalar multiple of the interquartile range divided by median, is simple and the associated confidence intervals have very good coverage over a diverse range of distribution types. Similarly, RCVM where the MAD is used instead of the interquartile range, interval also have excellent coverage and typically has smaller variability than the estimator for RCVQ making it a preferred candidate to be used instead of the CV. While we also considered bootstrap interval estimators for RCVM, the asymptotic Wald-type interval based on the approximate variances, and covariance between, the MAD and median achieved excellent coverage even for sample sizes as small as 50. These robust intervals compare very favorably to the CV where coverage is typically poor when the data is not sampled from a normal distribution. Our examples highlighted that they can provide very different insights into relative spread when compared to the CV, and the use of quantile-based measures is more easily justified when data is skewed due to difficulty interpreting the mean and variance.
Appendix A Proof of Theorem 3.1
Recall and in (3.3) and (3.4) respectively. For simplicity let , , and . Then
[TABLE]
It can be shown,
[TABLE]
Similarly,
[TABLE]
and
[TABLE]
Substituting the above (A.2), (A.3),(A) in (A.1) and using gives
[TABLE]
Appendix B Computing the true MAD
Computing the true value of MAD is not a trivial task. We provide an R function below that can be uses to compute true value of the MAD for a user-specified distribution.
mad <- function(dist, param){
Computes the true value of the MAD for a specific
distribution with desired parameter choices.
Args:
dist: The distribution whose MAD
is to be calculated.
param: The parameter choices of the selected
distribution whose MAD is to be calculated.
Returns:
The true value of the MAD for a specific
distribution with desired parameter choices.
qf <- paste0("q", dist) m <- do.call(qf, c(p = 0.5, param)) # find median abs.x.m <- function(x, dist, param, m){ df <- paste0("d", dist) do.call(df, c(x = x + m, param)) + do.call(df, c(x = - x + m, param)) } abs.x.m.vec <- Vectorize(abs.x.m, "x")
f <- function(x, dist, param, m){ integrate(abs.x.m.vec, lower = 0, upper = x, dist = dist, param = param, m = m)value - 0.5 } upper <- abs(do.call(qf, c(p = 0.75, param)) + m) uniroot(f, interval = c(0, upper), dist = dist, param = param, m = m)root } mad("lnorm", list(meanlog=0, sdlog=1)) mad("exp", list(rate=1))
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Andersen, (2008) Andersen, R. 2008. Modern methods for robust regression . Sage.
- 2Arachchige et al. , (2019) Arachchige, C. NPG, Cairns, M., & Prendergast, L. A. 2019. Interval estimators for ratios of independent quantiles and interquantile ranges. Commun. Stat. B-Simul. (accepted, June) .
- 3Atkinson, (1970) Atkinson, A. B. 1970. On the measurement of inequality. J. Econ. Theor. , 2 (3), 244–263.
- 4Bonett, (2006) Bonett, D. G. 2006. Confidence interval for a coefficient of quartile variation. Comput. Stat. Data An. , 50 (11), 2953–2957.
- 5Bonett & Seier, (2005) Bonett, D. G., & Seier, E. 2005. Confidence interval for a coefficient of dispersion in nonnormal distributions. Biometrical J. , 47 (1), 144–148.
- 6Bulent & Hamza, (2018) Bulent, A., & Hamza, G. 2018. Bootstrap confidence intervals for the coefficient of quartile variation. Commun. Stat. B-Simul. , In Press , 1–9.
- 7Chang et al. , (2017) Chang, W., Cheng, J., Allaire, J. J., Xie, Y, & Mc Pherson, J. 2017. shiny: Web application framework for r . R package version 1.0.5.
- 8Chen & Fleisher, (1996) Chen, J., & Fleisher, B. M. 1996. Regional income inequality and economic growth in china. J. Comp. Econ. , 22 (2), 141–164.
