Reproducing the Metric-Based Evaluation of a Set of Controllable Text Generation Techniques
Michela Lorandi, Anya Belz

TL;DR
This paper investigates the reproducibility of metric-based evaluations in controllable text generation, revealing inconsistencies and errors in original reports despite the availability of code and models.
Contribution
It demonstrates that rerunning metric-based evaluations often yields different results and uncovers errors in previous studies, highlighting reproducibility challenges in CTG research.
Findings
Reruns often differ from original results
Evaluation inconsistencies can reveal reporting errors
Reproducibility issues persist despite available resources
Abstract
Rerunning a metric-based evaluation should be more straightforward, and results should be closer, than in a human-based evaluation, especially where code and model checkpoints are made available by the original authors. As this report of our efforts to rerun a metric-based evaluation of a set of single-attribute and multiple-attribute controllable text generation (CTG) techniques shows however, such reruns of evaluations do not always produce results that are the same as the original results, and can reveal errors in the reporting of the original work.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques
MethodsSparse Evolutionary Training
