General hypergeometric distribution: A basic statistical distribution for the number of overlapped elements in multiple subsets drawn from a finite population
Xing-gang Mao, Xiao-yan Xue

TL;DR
This paper introduces the general hypergeometric distribution (GHGD), extending classical hypergeometric distribution to multiple subsets, with algorithms and formulas for its statistics, and applies it to gene set inference.
Contribution
The paper develops algorithms and formulas for GHGD, a novel distribution for multiple subset overlaps, and establishes a statistical framework for gene set analysis.
Findings
Derived formulas for expectation, variance, and moments of GHGD.
Developed algorithms to compute GHGD probabilities.
Applied Chebyshev's inequalities for gene set inference.
Abstract
General hypergeometric distribution (GHGD) describes the following distribution: from a finite space containing N elements, select T subsets with each subset contains M[i] (T-1 >= i >= 0) elements, what is the probability that exactly x elements are overlapped exactly t times or at least t times (XLO=t or XLO>=t, T >= t >= 0, here LO is level of overlap)? The classical hypergeometric distribution (HGD) describes the situation of two subsets, while the general situation has not been resolved, despite the overlapped elements has been visualized with the Venn diagram method for about 140 years. GHGD described not only the distribution of XLO=t or XLO>=t that are overlapped in all of the subsets (XLO=T), but also the XLO=t or XLO>=t that are overlapped in a portion of the subsets (LO = t or LO >= t, T >= t >= 0). Here, we developed algorithms to calculate the GHGD and discovered graceful…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsBayesian Methods and Mixture Models
