A Fisher's exact test justification of the TF-IDF term-weighting scheme

Paul Sheridan; Zeyad Ahmed; Aitazaz A. Farooque

arXiv:2507.15742·cs.CL·July 31, 2025

A Fisher's exact test justification of the TF-IDF term-weighting scheme

Paul Sheridan, Zeyad Ahmed, Aitazaz A. Farooque

PDF

TL;DR

This paper provides a statistical justification for TF-IDF by linking it to Fisher's exact test, showing its effectiveness as a significance measure in text analysis.

Contribution

It demonstrates that TF-IDF, specifically TF-ICF, can be understood as the negative log of a Fisher's exact test p-value, establishing a theoretical foundation.

Findings

01

TF-ICF relates to Fisher's exact test p-value

02

The negative log p-value approximates TF-IDF under regularity conditions

03

As collection size grows, the p-value-based measure converges to TF-IDF

Abstract

Term frequency-inverse document frequency, or TF-IDF for short, is arguably the most celebrated mathematical expression in the history of information retrieval. Conceived as a simple heuristic quantifying the extent to which a given term's occurrences are concentrated in any one given document out of many, TF-IDF and its many variants are routinely used as term-weighting schemes in diverse text analysis applications. There is a growing body of scholarship dedicated to placing TF-IDF on a sound theoretical foundation. Building on that tradition, this paper justifies the use of TF-IDF to the statistics community by demonstrating how the famed expression can be understood from a significance testing perspective. We show that the common TF-IDF variant TF-ICF is, under mild regularity conditions, closely related to the negative logarithm of the $p$ -value from a one-tailed version of Fisher's…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.