QRA++: Quantified Reproducibility Assessment for Common Types of Results in Natural Language Processing

Anya Belz

arXiv:2505.17043·cs.CL·May 26, 2025

QRA++: Quantified Reproducibility Assessment for Common Types of Results in Natural Language Processing

Anya Belz

PDF

TL;DR

QRA++ introduces a quantitative, continuous measure for assessing reproducibility in NLP experiments, enabling better comparison and understanding of factors influencing reproducibility.

Contribution

It provides a novel, standardized approach to quantify reproducibility across studies, considering experiment similarity, system type, and evaluation method.

Findings

01

Reproducibility varies with experiment similarity.

02

System type influences reproducibility.

03

Evaluation method impacts reproducibility.

Abstract

Reproduction studies reported in NLP provide individual data points which in combination indicate worryingly low levels of reproducibility in the field. Because each reproduction study reports quantitative conclusions based on its own, often not explicitly stated, criteria for reproduction success/failure, the conclusions drawn are hard to interpret, compare, and learn from. In this paper, we present QRA++, a quantitative approach to reproducibility assessment that (i) produces continuous-valued degree of reproducibility assessments at three levels of granularity; (ii) utilises reproducibility measures that are directly comparable across different studies; and (iii) grounds expectations about degree of reproducibility in degree of similarity between experiments. QRA++ enables more informative reproducibility assessments to be conducted, and conclusions to be drawn about what causes…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.