A Fisher's exact test justification of the TF-IDF term-weighting scheme
Paul Sheridan, Zeyad Ahmed, Aitazaz A. Farooque

TL;DR
This paper provides a statistical justification for TF-IDF by linking it to Fisher's exact test, showing its effectiveness as a significance measure in text analysis.
Contribution
It demonstrates that TF-IDF, specifically TF-ICF, can be understood as the negative log of a Fisher's exact test p-value, establishing a theoretical foundation.
Findings
TF-ICF relates to Fisher's exact test p-value
The negative log p-value approximates TF-IDF under regularity conditions
As collection size grows, the p-value-based measure converges to TF-IDF
Abstract
Term frequency-inverse document frequency, or TF-IDF for short, is arguably the most celebrated mathematical expression in the history of information retrieval. Conceived as a simple heuristic quantifying the extent to which a given term's occurrences are concentrated in any one given document out of many, TF-IDF and its many variants are routinely used as term-weighting schemes in diverse text analysis applications. There is a growing body of scholarship dedicated to placing TF-IDF on a sound theoretical foundation. Building on that tradition, this paper justifies the use of TF-IDF to the statistics community by demonstrating how the famed expression can be understood from a significance testing perspective. We show that the common TF-IDF variant TF-ICF is, under mild regularity conditions, closely related to the negative logarithm of the -value from a one-tailed version of Fisher's…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
