Improving Statistical Significance in Human Evaluation of Automatic Metrics via Soft Pairwise Accuracy
Brian Thompson, Nitika Mathur, Daniel Deutsch, Huda, Khayrallah

TL;DR
The paper introduces Soft Pairwise Accuracy (SPA), a new meta-metric for evaluating automatic metrics against human judgments, which improves stability, discrimination, and statistical significance over existing methods.
Contribution
SPA extends Pairwise Accuracy by incorporating statistical significance, addressing stability and discrimination issues in metric evaluation.
Findings
SPA is more stable than PA with varying data sizes.
SPA fixes the issue of limited output values in PA.
SPA produces more statistically significant metric comparisons.
Abstract
Selecting an automatic metric that best emulates human annotators is often non-trivial, because there is no clear definition of "best emulates." A meta-metric is required to compare the human judgments to the automatic metric scores, and metric rankings depend on the choice of meta-metric. We propose Soft Pairwise Accuracy (SPA), a new meta-metric that builds on Pairwise Accuracy (PA) but incorporates the statistical significance of both the human judgments and the metric scores. We show that SPA is more stable than PA with respect to changes in the number of systems/segments used for evaluation. We also show that PA can only assign a small set of distinct output values to metrics, and this results in many metrics being artificially assigned the exact same PA score. We demonstrate that SPA fixes this issue. Finally, we show that SPA is more discriminative than PA, producing more…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSensor Technology and Measurement Systems · Industrial Vision Systems and Defect Detection
MethodsSparse Evolutionary Training
