Macro F1 and Macro F1

Juri Opitz; Sebastian Burst

arXiv:1911.03347·cs.LG·February 9, 2021·24 cites

Macro F1 and Macro F1

Juri Opitz, Sebastian Burst

PDF

Open Access 1 Repo

TL;DR

This paper examines two different formulas for calculating the macro F1 score, revealing they are often not equivalent and can significantly impact classifier evaluation and ranking.

Contribution

It identifies and analyzes the differences between two formulas for macro F1, highlighting their implications for classifier evaluation.

Findings

01

The two formulas for macro F1 can differ by as much as 0.5.

02

The formulas may lead to different classifier rankings.

03

Differences are especially pronounced with skewed error distributions.

Abstract

The 'macro F1' metric is frequently used to evaluate binary, multi-class and multi-label classification problems. Yet, we find that there exist two different formulas to calculate this quantity. In this note, we show that only under rare circumstances the two computations can be considered equivalent. More specifically, one formula well 'rewards' classifiers which produce a skewed error type distribution. In fact, the difference in outcome of the two computations can be as high as 0.5. The two computations may not only diverge in their scalar result but can also lead to different classifier rankings.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Aitslab/BioNLP
pytorch

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsText and Document Classification Technologies · Imbalanced Data Classification Techniques · Spam and Phishing Detection