Testing High-dimensional Multinomials with Applications to Text Analysis
T. Tony Cai, Zheng Tracy Ke, Paxton Turner

TL;DR
This paper develops a statistically optimal test for comparing high-dimensional multinomial distributions, with applications in text analysis, demonstrated through simulations and real-world datasets on movie reviews and research abstracts.
Contribution
It introduces a new test statistic for high-dimensional multinomials that achieves the optimal detection boundary, improving inference in text mining applications.
Findings
Test statistic follows asymptotic normal distribution under null hypothesis.
Proposed method achieves the optimal detection boundary.
Validated through simulations and real-world data analysis.
Abstract
Motivated by applications in text mining and discrete distribution inference, we investigate the testing for equality of probability mass functions of groups of high-dimensional multinomial distributions. A test statistic, which is shown to have an asymptotic standard normal distribution under the null, is proposed. The optimal detection boundary is established, and the proposed test is shown to achieve this optimal detection boundary across the entire parameter space of interest. The proposed method is demonstrated in simulation studies and applied to analyze two real-world datasets to examine variation among consumer reviews of Amazon movies and diversity of statistical paper abstracts.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsBayesian Methods and Mixture Models · Data-Driven Disease Surveillance
MethodsTest
