What the F-measure doesn't measure: Features, Flaws, Fallacies and Fixes
David M. W. Powers

TL;DR
This paper critically examines the limitations of the F-measure in information retrieval and machine learning, highlighting its flawed assumptions and proposing better alternative metrics for evaluation.
Contribution
It reveals fundamental flaws in the F-measure's assumptions and offers improved evaluation metrics for more accurate performance assessment.
Findings
F-measure's assumptions are flawed and lead to misleading evaluations
Alternative metrics provide more reliable performance measurement
The paper clarifies common misconceptions about F-measure usage
Abstract
The F-measure or F-score is one of the most commonly used single number measures in Information Retrieval, Natural Language Processing and Machine Learning, but it is based on a mistake, and the flawed assumptions render it unsuitable for use in most contexts! Fortunately, there are better alternatives.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsInformation Retrieval and Search Behavior · Natural Language Processing Techniques · Mathematics, Computing, and Information Processing
