TL;DR
This paper highlights the prevalence of invalid comparison practices in end-to-end relation extraction research, quantifies their impact, and advocates for standardized evaluation protocols to ensure reliable performance assessment.
Contribution
It identifies common invalid comparison patterns, empirically measures their effect on reported results, and promotes unified evaluation standards in end-to-end RE research.
Findings
Invalid comparisons can overestimate performance by around 5%.
Using BERT and span-level NER impacts RE results.
Standardized evaluation is essential for reliable progress tracking.
Abstract
Despite efforts to distinguish three different evaluation setups (Bekoulis et al., 2018), numerous end-to-end Relation Extraction (RE) articles present unreliable performance comparison to previous work. In this paper, we first identify several patterns of invalid comparisons in published papers and describe them to avoid their propagation. We then propose a small empirical study to quantify the impact of the most common mistake and evaluate it leads to overestimating the final RE performance by around 5% on ACE05. We also seize this opportunity to study the unexplored ablations of two recent developments: the use of language model pretraining (specifically BERT) and span-level NER. This meta-analysis emphasizes the need for rigor in the report of both the evaluation setting and the datasets statistics and we call for unifying the evaluation setting in end-to-end RE.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
