An Empirical Study into Annotator Agreement, Ground Truth Estimation, and Algorithm Evaluation
Thomas A. Lampert, Andr\'e Stumpf, Pierre Gan\c{c}arski

TL;DR
This study investigates how annotator disagreement impacts the evaluation of object detection algorithms in computer vision, revealing that ground truth variability significantly influences performance assessments and ranking of detectors.
Contribution
It introduces a methodology to quantify inter-annotator variance, analyzes its effect on ground truth estimation, and examines how different GT methods influence algorithm evaluation.
Findings
Annotator agreement is very low for linear object detection.
Ground truth estimation methods significantly affect detector ranking.
Consensus voting can overestimate algorithm performance.
Abstract
Although agreement between annotators has been studied in the past from a statistical viewpoint, little work has attempted to quantify the extent to which this phenomenon affects the evaluation of computer vision (CV) object detection algorithms. Many researchers utilise ground truth (GT) in experiments and more often than not this GT is derived from one annotator's opinion. How does the difference in opinion affect an algorithm's evaluation? Four examples of typical CV problems are chosen, and a methodology is applied to each to quantify the inter-annotator variance and to offer insight into the mechanisms behind agreement and the use of GT. It is found that when detecting linear objects annotator agreement is very low. The agreement in object position, linear or otherwise, can be partially explained through basic image properties. Automatic object detectors are compared to annotator…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
