TL;DR
This paper demonstrates that F1-score and AVPR are unreliable metrics for anomaly detection evaluation due to their sensitivity to contamination rates and dataset differences, advocating for more robust evaluation protocols like AUC.
Contribution
The paper reveals the bias introduced by certain evaluation protocols in anomaly detection and proposes a more robust, standardized evaluation procedure using metrics like AUC.
Findings
F1-score and AVPR are highly sensitive to contamination rates.
Artificially modifying train-test splits can inflate performance metrics.
F1-score and AVPR are not suitable for comparing different datasets.
Abstract
Anomaly detection is a widely explored domain in machine learning. Many models are proposed in the literature, and compared through different metrics measured on various datasets. The most popular metrics used to compare performances are F1-score, AUC and AVPR. In this paper, we show that F1-score and AVPR are highly sensitive to the contamination rate. One consequence is that it is possible to artificially increase their values by modifying the train-test split procedure. This leads to misleading comparisons between algorithms in the literature, especially when the evaluation protocol is not well detailed. Moreover, we show that the F1-score and the AVPR cannot be used to compare performances on different datasets as they do not reflect the intrinsic difficulty of modeling such data. Based on these observations, we claim that F1-score and AVPR should not be used as metrics for anomaly…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
