TL;DR
This paper introduces RankME, a novel evaluation method for natural language generation that enhances the reliability and consistency of human ratings through a rank-based magnitude estimation approach, enabling multi-criteria assessment and cost-effective system ranking.
Contribution
The paper proposes RankME, a new evaluation technique combining continuous scales and relative assessments to improve human judgment quality in NLG evaluation.
Findings
RankME significantly improves rating reliability and consistency.
It allows evaluation of multiple criteria for NLG systems.
RankME combined with Bayesian estimation is cost-effective for system ranking.
Abstract
Human evaluation for natural language generation (NLG) often suffers from inconsistent user ratings. While previous research tends to attribute this problem to individual user preferences, we show that the quality of human judgements can also be improved by experimental design. We present a novel rank-based magnitude estimation method (RankME), which combines the use of continuous scales and relative assessments. We show that RankME significantly improves the reliability and consistency of human ratings compared to traditional evaluation methods. In addition, we show that it is possible to evaluate NLG systems according to multiple, distinct criteria, which is important for error analysis. Finally, we demonstrate that RankME, in combination with Bayesian estimation of system quality, is a cost-effective alternative for ranking multiple NLG systems.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
