On approximation of the distribution for Pearson statistic
Nikolai Dokuchaev

TL;DR
This paper proposes approximating the distribution of the Pearson goodness-of-fit statistic using a Gamma distribution with parameters estimated from the first two moments, simplifying quantile calculations especially for small samples.
Contribution
It introduces a novel method to approximate the Pearson statistic distribution with a Gamma distribution based on moment estimation, improving small-sample quantile calculations.
Findings
Gamma approximation aligns well with empirical distributions
Simplifies quantile computation for small samples
Validated through simulation experiments
Abstract
The paper considers the classical Goodness of Fit test. It suggests to use the Gamma distribution for the approximation of the distribution of the Pearson statistics with unknown parameters estimated from raw data. The parameters of these Gamma distribution can be estimated from the first two moments of the statistic after averaging over a distribution of the unknown parameter over its range. This allows to simplify calculation of the quantiles for the Pearson statistic, as is shown in some simulation experiments with medium and small sample sizes.
| Quantiles | 0.75 | 0.9 | 0.95 | 0.99 |
|---|---|---|---|---|
| For | 1.801390 | 3.052967 | 4.146487 | 6.279877 |
| Quantiles | 0.75 | 0.9 | 0.95 | 0.99 |
|---|---|---|---|---|
| For | 1.787257 | 3.111514 | 4.296282 | 6.762272 |
| For fitted | 1.831157 | 3.204561 | 4.262158 | 6.760412 |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsBayesian Methods and Mixture Models
\epstopdfDeclareGraphicsRule
.tifpng.pngconvert #1 \OutputFile \AppendGraphicsExtensions.tif
On approximation of the distribution for Pearson statistic
Nikolai Dokuchaev
Abstract
The paper considers the classical Goodness of Fit test. It suggests to use the Gamma distribution for the approximation of the distribution of the Pearson statistics with unknown parameters estimated from raw data. The parameters of these Gamma distribution can be estimated from the first two moments of the statistic after averaging over a distribution of the unknown parameter over its range. This allows to simplify calculation of the quantiles for the Pearson statistic, as is shown in some simulation experiments with medium and small sample sizes.
Keywords: goodness of fit test, Pearson statistic, probability distributions
MSC classification: 62F03, 62G05, 62G10.
††The author is with the School of Electrical Engineering, Computing and Mathematical Sciences, Curtin University, GPO Box U1987, Perth, Western Australia, 6845
I Introduction
The classical statistical Goodness of Fit test addresses the problem of estimation the parameters of a parametric family of distributions from observed data with unknown -dimensional parameter that has to be fitted from the data. The Pearson statistics is commonly used to estimate the error; see, e.g., the literature review in [1, 4, 5]. Let be the number in intervals where the observations are counted in the Pearson statistic. The limit distribution of this statistics for infinitely increasing sample size is known, given some mild conditions; see, e.g. [1, 2, 4, 5]. The quantiles for its limit distribution are often used as the critical values for the test. If the parameter is fitted from the raw (ungrouped) data using a consistent estimator, then the limit distribution is different; see, e.g. [1], p.24. The actual distribution of statistic for the finite samples is a discreet distribution and depends on the choice of the counting intervals and other parameters of the experiment.
This sort paper suggests to use the Gamma distribution for a simplified approximation of the distribution of the Pearson statistic with small and medium sample sizes. The parameters of these Gamma distribution can be estimated from the first two moments of the sample distribution of simulated Pearson statistic with parameter values randomized over a domain for the unknown true parameter. Some computer experiments with medium and small sample sizes shows that this helps to reduce the bias for the calculation of the quantiles for the Pearson statistic.
II Problem setting
Let be a give family of distributions, where is a -dimensional parameter, and where is a domain.
Assume that we are testing a hypothesis about a population distribution for a given independent and identically distributed sample from the distribution , where . Let be the estimate of obtained using a consistent estimator , where is a mapping. We assume that is observable.
For a given integer , consider a system of mutually disjoint intervals such that (two of these intervals are semi-infinite). Let be observed sample counts in the intervals , calculated from a sample . In particular, we have that
[TABLE]
The values are supposed to be observable.
Consider two hypotheses:
: The sample come from the distribution .
: The sample does not come from this distribution.
A hypothesis has to be accepted or rejected based on observed and observed counts , given that the family of the distributions , and the domain of possible values of are known.
Let .
Let us consider Pearson’s statistic
[TABLE]
If the computed value of is large, then we reject hypothesis . In this case, the observed and expected values are not close and the model is a poor fit to the data.
If the parameter is fitted from the grouped data using a consistent estimator based on counting in the intervals, then the limit distribution is a known -distribution, given some mild conditions; see, e.g. [1, 2, 4, 5]. If the parameter is fitted from the raw (ungrouped) data using a consistent estimator, then the limit distribution is
[TABLE]
where are independent standard normal variables and (Chernoff and Lehmann [3]; see also [1], p.24). However, the values depend on the intervals, on the population distribution, and on the estimator.
The distribution of is discreet and depends on the choice of . The standard approach for the approximation of the distribution of for large is its approximation distribution, i.e,, by the -distribution with degrees of freedom (see, e.g., [2]). In the literature, is called chi square statistic or Pearson’s statistic. This distribution is independent on the choice of the set . However, the actual distribution of is not easy to describe; it depends on . Therefore it is not easy to calculate quantiles used for the hypothesis testing. On the other hand, some numerical examples given below show that use of quantiles for the distribution as a substitution for quantiles of could lead to significant bias for the critical values.
The distribution of is discreet and depends on the choice of . It is known that, under some mild assumptions,
[TABLE]
(See, e.g., [2]). The limit distribution here is the -distribution with degrees of freedom. In the literature, is called chi square statistic or Pearson’s statistic. This limit distribution is independent on . However, the actual distribution of is not easy to describe for a given finite . Therefore it is not easy to calculate quantiles for the hypothesis testing. On the other hand, some numerical examples given below show that use of quantiles for the distribution as a substitution for quantiles of could lead to significant bias for the critical values.
There are several known approaches to deal with this bias; see, e.g. [4]. We suggest one more approach that seems to provide a reasonably close approximation for the distribution of the tests statistics with medium and small sample sizes.
III Approximation by Gamma distribution
In some numerical experiments, we have found that the Gamma distribution with the density can be effectively used as a close approximation of the distribution of .
For the distribution , the expectation is , and the variance is .
Technically, the distribution of and as well as depend on the choice of . However, we need an approximation that does not use . Therefore, we suggest to estimate these parameters via matching them with the first two moments of a sample of random values simulated under the compound distribution with a random given some preselected probability distribution for . This removes dependence of on the true parameter . For example, one can use a non-informative uniform distribution over a bounded domain containing the true parameter
Let and be the sample mean and the sample variance, respectively, over the Monte-Carlo trials for simulation of generating the implied statistic .
The procedure for fitting
- (i)
Run Monte-Carlo simulations of . For each simulated , simulate an i.i.d. sample with the terms distributed under . 2. (ii)
Calculate for each simulation of . 3. (iii)
Calculate and . 4. (iv)
Find and such that
[TABLE]
i.e.
[TABLE] 5. (v)
Use quantiles for as approximations for quantiles for .
It seems that this approach allows to achieve a significant reduction of the bias for quantiles for the sample sizes.
IV Numerical examples
Let illustrate the difference between the limit distribution and actual distribution of using the following numerical example.
This would correspond to the setting with and . .
We run Monte-Carlo experiments with the sample size for . We run these experiments for four cases with different sets of parameters. These cases are listed below.
Case A:
For this case, we simulated for the sample from exponential distribution , i.e. with the density . This corresponds to the case of non-random . We have used , and we have used the estimate , where . This is a maximum likelihood estimate as well as the estimate implied by the method of moments. It is known that this estimate is consistent.
Case A(i): The sample size for the underlying process is , the number of intervals is , and , , . The numbers are such that . This choice corresponds to the most basic case where of equal probabilities for the intervals. For this case, we found in the experiments with Monte-Carlo trials that and .
As can be seen, it is quite far for the expectation and the variance for the distribution; these parameters are and respectively.
Case A(ii): The sample size for the underlying process is ; the remaining parameters are the same as for Case A(i). For Case A(ii), we found in the experiments with Monte-Carlo trials that and .
Table I shows sample quantiles for for Cases A(i)-(ii). This example shows that use of quantiles for the limit distribution as a substitution for quantiles of for finite samples could lead to a bias. Approximation by Gamma function helps to reduce the bias, as is shown in examples described below.
We have also considered cases where the parameters have been fitted to the sample simulated according to the procedure describes above.
Case B: For this case, we consider the family the exponential distribution with the density . We assumed that . For the step (i) of this procedure, we have used uniformly distributed on the domain . Further, the sample size for the underlying process for this case is , the number of intervals is , and , , and .
For this case, we have , , and the corresponding parameters for are
[TABLE]
Table II(i) shows quantiles for , for the fitted distribution , and for the limit distribution .
Case C: For this case, we consider , , and . All other parameters are the same as for Case B.
For this case, we have , , and the corresponding parameters for are
[TABLE]
Table II(ii) shows quantiles for , for the fitted distribution , and for the limit distribution .
Case D: For this case, we consider . All other parameters are the same as for Case C.
For this case, we have , , and the corresponding parameters for are
[TABLE]
Table II(iii) shows quantiles for , for the fitted distribution , and for the limit distribution .
Case E: For this case,we consider the family the normal distributions with and and . The random parameter as a random vector with independent components distributed uniformly on and respectively. We used such that is the sample mean of and is the sample variance of . The number of intervals is , and the intervals are , , , and .
For this case, we have , , and the corresponding parameters for are
[TABLE]
Table II (iv) shows quantiles for and for the fitted distribution .
Case F: For this case,we consider the family the normal distributions with and and . The random parameter as a random vector with independent components distributed uniformly on and respectively. The intervals and the estimatres are the same as in Case E. , , , and .
For this case, we have , , and the corresponding parameters for are
[TABLE]
Table II (v) shows quantiles for and for the fitted distribution . [
Figures 1-2 show smoothed histograms for and for Cased D,E,and F, respectively, constructed from the histograms for Monte-Carlo samples of the size using the standard command densities in R programming language. These figures demonstrate quite close approximation.
We have used R programming language for calculations; calculation of for and takes less than a minute for a standard desktop computer. For and , it takes about 10 minutes.
V Conclusion
The paper suggest to approximate the distribution of the Pearson statistic by the Gamma distributions with parameters fitted to simulated statistics a given configuration of cells where the sample occurrences are being counted. Feasibility of this approach is demonstrate with some numerical experiments. So far, the range of the parameters for these experiment was quite limited. It would be interesting to extend these experiments on more general choices of the parameters, especially and . We leave this for the future research.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] Balakrishnan, N., Voinov, V., Nikulin M. S.( 2013). Chi-Squared Goodness of Fit Tests With Applications, Academic Press.
- 2[2] Birch, M. W. (1964). A New Proof of the Pearson-Fisher Theorem. Ann. Math. Statist., 35 , No. 2, 817-824.
- 3[3] Chernoff, H., Lehmann, E.L. (1954). The use of maximum likelihood estimates in tests for goodness of fit. The Annals of Mathematical Statistics 25, 579-589.
- 4[4] Greenwood, C., Nikulin, M. S. (1996). A guide to chi-squared testing, New York: Wiley.
- 5[5] Plackett, R.L. (1983). Karl Pearson and the Chi-Squared Test. International Statistical Review , 51, 59-72.
