Macro F1 and Macro F1
Juri Opitz, Sebastian Burst

TL;DR
This paper examines two different formulas for calculating the macro F1 score, revealing they are often not equivalent and can significantly impact classifier evaluation and ranking.
Contribution
It identifies and analyzes the differences between two formulas for macro F1, highlighting their implications for classifier evaluation.
Findings
The two formulas for macro F1 can differ by as much as 0.5.
The formulas may lead to different classifier rankings.
Differences are especially pronounced with skewed error distributions.
Abstract
The 'macro F1' metric is frequently used to evaluate binary, multi-class and multi-label classification problems. Yet, we find that there exist two different formulas to calculate this quantity. In this note, we show that only under rare circumstances the two computations can be considered equivalent. More specifically, one formula well 'rewards' classifiers which produce a skewed error type distribution. In fact, the difference in outcome of the two computations can be as high as 0.5. The two computations may not only diverge in their scalar result but can also lead to different classifier rankings.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsText and Document Classification Technologies · Imbalanced Data Classification Techniques · Spam and Phishing Detection
