Reproducing the Metric-Based Evaluation of a Set of Controllable Text   Generation Techniques

Michela Lorandi; Anya Belz

arXiv:2405.07875·cs.CL·May 14, 2024

Reproducing the Metric-Based Evaluation of a Set of Controllable Text Generation Techniques

Michela Lorandi, Anya Belz

PDF

Open Access

TL;DR

This paper investigates the reproducibility of metric-based evaluations in controllable text generation, revealing inconsistencies and errors in original reports despite the availability of code and models.

Contribution

It demonstrates that rerunning metric-based evaluations often yields different results and uncovers errors in previous studies, highlighting reproducibility challenges in CTG research.

Findings

01

Reruns often differ from original results

02

Evaluation inconsistencies can reveal reporting errors

03

Reproducibility issues persist despite available resources

Abstract

Rerunning a metric-based evaluation should be more straightforward, and results should be closer, than in a human-based evaluation, especially where code and model checkpoints are made available by the original authors. As this report of our efforts to rerun a metric-based evaluation of a set of single-attribute and multiple-attribute controllable text generation (CTG) techniques shows however, such reruns of evaluations do not always produce results that are the same as the original results, and can reveal errors in the reporting of the original work.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques

MethodsSparse Evolutionary Training