Goodness of fit statistics for sparse contingency tables
Audrey Finkler (IRMA)

TL;DR
This paper introduces corrected goodness of fit statistics for sparse contingency tables, improving classical tests like Pearson's chi-square and G, especially when many cells are empty, with demonstrated accuracy through simulations and real data applications.
Contribution
It proposes simple corrections for Q and G statistics that extend their applicability to sparse multinomial data, maintaining asymptotic properties.
Findings
Corrected statistics perform better with sparse data.
Asymptotic distribution remains unchanged after correction.
Effective in epidemiologic and ecological data analysis.
Abstract
Statistical data is often analyzed as a contingency table, sometimes with empty cells called zeros. Such sparse tables can be due to scarse observations classified in numerous categories, as for example in genetic association studies. Thus, classical independence tests involving Pearson's chi-square statistic Q or Kullback's minimum discrimination information statistic G cannot be applied because some of the expected frequencies are too small. More generally, we consider goodness of fit tests with composite hypotheses for sparse multinomial vectors and suggest simple corrections for Q and G that improve and generalize known procedures such as Ku's. We show that the corrected statistics share the same asymptotic distribution as the initial statistics. We produce Monte Carlo estimations for the type I and type II errors on a toy example. Finally, we apply the corrected statistics to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
