Bidimensional Leaderboards: Generate and Evaluate Language Hand in Hand
Jungo Kasai, Keisuke Sakaguchi, Ronan Le Bras, Lavinia Dunagan, Jacob, Morrison, Alexander R. Fabbri, Yejin Choi, Noah A. Smith

TL;DR
This paper introduces bidimensional leaderboards (Billboards) that simultaneously evaluate language generation models and metrics, enabling the development of better evaluation tools that correlate more closely with human judgments.
Contribution
The paper proposes a novel bidimensional leaderboard framework that tracks progress in both language models and evaluation metrics, fostering mutual improvement and more accurate assessment.
Findings
Ensemble metrics can outperform individual metrics in evaluation.
Most automatic metrics tend to overrate machine-generated content compared to human.
Updating metrics is crucial as models become more human-like.
Abstract
Natural language processing researchers have identified limitations of evaluation methodology for generation tasks, with new questions raised about the validity of automatic metrics and of crowdworker judgments. Meanwhile, efforts to improve generation models tend to depend on simple n-gram overlap metrics (e.g., BLEU, ROUGE). We argue that new advances on models and metrics should each more directly benefit and inform the other. We therefore propose a generalization of leaderboards, bidimensional leaderboards (Billboards), that simultaneously tracks progress in language generation models and metrics for their evaluation. Unlike conventional unidimensional leaderboards that sort submitted systems by predetermined metrics, a Billboard accepts both generators and evaluation metrics as competing entries. A Billboard automatically creates an ensemble metric that selects and linearly…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Text Readability and Simplification
