Bidimensional Leaderboards: Generate and Evaluate Language Hand in Hand

Jungo Kasai; Keisuke Sakaguchi; Ronan Le Bras; Lavinia Dunagan; Jacob; Morrison; Alexander R. Fabbri; Yejin Choi; Noah A. Smith

arXiv:2112.04139·cs.CL·May 20, 2022

Bidimensional Leaderboards: Generate and Evaluate Language Hand in Hand

Jungo Kasai, Keisuke Sakaguchi, Ronan Le Bras, Lavinia Dunagan, Jacob, Morrison, Alexander R. Fabbri, Yejin Choi, Noah A. Smith

PDF

Open Access 2 Repos

TL;DR

This paper introduces bidimensional leaderboards (Billboards) that simultaneously evaluate language generation models and metrics, enabling the development of better evaluation tools that correlate more closely with human judgments.

Contribution

The paper proposes a novel bidimensional leaderboard framework that tracks progress in both language models and evaluation metrics, fostering mutual improvement and more accurate assessment.

Findings

01

Ensemble metrics can outperform individual metrics in evaluation.

02

Most automatic metrics tend to overrate machine-generated content compared to human.

03

Updating metrics is crucial as models become more human-like.

Abstract

Natural language processing researchers have identified limitations of evaluation methodology for generation tasks, with new questions raised about the validity of automatic metrics and of crowdworker judgments. Meanwhile, efforts to improve generation models tend to depend on simple n-gram overlap metrics (e.g., BLEU, ROUGE). We argue that new advances on models and metrics should each more directly benefit and inform the other. We therefore propose a generalization of leaderboards, bidimensional leaderboards (Billboards), that simultaneously tracks progress in language generation models and metrics for their evaluation. Unlike conventional unidimensional leaderboards that sort submitted systems by predetermined metrics, a Billboard accepts both generators and evaluation metrics as competing entries. A Billboard automatically creates an ensemble metric that selects and linearly…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Text Readability and Simplification