What the F-measure doesn't measure: Features, Flaws, Fallacies and Fixes

David M. W. Powers

arXiv:1503.06410·cs.IR·September 13, 2019·63 cites

What the F-measure doesn't measure: Features, Flaws, Fallacies and Fixes

David M. W. Powers

PDF

Open Access

TL;DR

This paper critically examines the limitations of the F-measure in information retrieval and machine learning, highlighting its flawed assumptions and proposing better alternative metrics for evaluation.

Contribution

It reveals fundamental flaws in the F-measure's assumptions and offers improved evaluation metrics for more accurate performance assessment.

Findings

01

F-measure's assumptions are flawed and lead to misleading evaluations

02

Alternative metrics provide more reliable performance measurement

03

The paper clarifies common misconceptions about F-measure usage

Abstract

The F-measure or F-score is one of the most commonly used single number measures in Information Retrieval, Natural Language Processing and Machine Learning, but it is based on a mistake, and the flawed assumptions render it unsuitable for use in most contexts! Fortunately, there are better alternatives.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsInformation Retrieval and Search Behavior · Natural Language Processing Techniques · Mathematics, Computing, and Information Processing