Beyond the Numbers: Transparency in Relation Extraction Benchmark Creation and Leaderboards
Varvara Arzt, Allan Hanbury

TL;DR
This paper critically examines the transparency issues in relation extraction benchmarks and leaderboards, highlighting their limitations and advocating for better documentation and evaluation practices to genuinely measure progress.
Contribution
It identifies key transparency shortcomings in RE benchmarks and leaderboards, proposing improvements for documentation and evaluation to better assess model performance.
Findings
RE benchmarks like TACRED and NYT are highly imbalanced and noisy
Current leaderboards rely mainly on aggregate metrics like F1-score
Class-based performance metrics are often missing, obscuring true model capabilities
Abstract
This paper investigates the transparency in the creation of benchmarks and the use of leaderboards for measuring progress in NLP, with a focus on the relation extraction (RE) task. Existing RE benchmarks often suffer from insufficient documentation, lacking crucial details such as data sources, inter-annotator agreement, the algorithms used for the selection of instances for datasets, and information on potential biases like dataset imbalance. Progress in RE is frequently measured by leaderboards that rank systems based on evaluation methods, typically limited to aggregate metrics like F1-score. However, the absence of detailed performance analysis beyond these metrics can obscure the true generalisation capabilities of models. Our analysis reveals that widely used RE benchmarks, such as TACRED and NYT, tend to be highly imbalanced and contain noisy labels. Moreover, the lack of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSemantic Web and Ontologies
MethodsFocus
