Goodness of fit statistics for sparse contingency tables
Audrey Finkler (IRMA)

TL;DR
This paper proposes simple correction methods for goodness of fit tests like Pearson's chi-square and G statistic in sparse contingency tables, improving their accuracy and applicability in fields like genetics and epidemiology.
Contribution
It introduces corrections for classical goodness of fit tests to handle sparse multinomial data, extending existing procedures and maintaining asymptotic properties.
Findings
Corrected statistics have the same asymptotic distribution as original ones.
Monte Carlo simulations show improved error rates.
Applied to real epidemiologic and ecological data with successful results.
Abstract
Statistical data is often analyzed as a contingency table, sometimes with empty cells called zeros. Such sparse tables can be due to scarse observations classified in numerous categories, as for example in genetic association studies. Thus, classical independence tests involving Pearson's chi-square statistic Q or Kullback's minimum discrimination information statistic G cannot be applied because some of the expected frequencies are too small. More generally, we consider goodness of fit tests with composite hypotheses for sparse multinomial vectors and suggest simple corrections for Q and G that improve and generalize known procedures such as Ku's. We show that the corrected statistics share the same asymptotic distribution as the initial statistics. We produce Monte Carlo estimations for the type I and type II errors on a toy example. Finally, we apply the corrected statistics to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSensory Analysis and Statistical Methods · Data Management and Algorithms · Bayesian Modeling and Causal Inference
