Analyzing and Evaluating Correlation Measures in NLG Meta-Evaluation
Mingqi Gao, Xinyu Hu, Li Lin, Xiaojun Wan

TL;DR
This paper systematically analyzes 12 correlation measures for NLG meta-evaluation, revealing how different measures affect evaluation outcomes and proposing perspectives to assess their effectiveness.
Contribution
It provides a comprehensive comparison of correlation measures in NLG meta-evaluation and introduces three perspectives to evaluate their capabilities.
Findings
Pearson correlation with global grouping performs best in discriminative power and ranking consistency.
Kendall correlation measures are least sensitive to score granularity.
Different correlation measures significantly impact meta-evaluation results.
Abstract
The correlation between NLG automatic evaluation metrics and human evaluation is often regarded as a critical criterion for assessing the capability of an evaluation metric. However, different grouping methods and correlation coefficients result in various types of correlation measures used in meta-evaluation. In specific evaluation scenarios, prior work often directly follows conventional measure settings, but the characteristics and differences between these measures have not gotten sufficient attention. Therefore, this paper analyzes 12 common correlation measures using a large amount of real-world data from six widely-used NLG evaluation datasets and 32 evaluation metrics, revealing that different measures indeed impact the meta-evaluation results. Furthermore, we propose three perspectives that reflect the capability of meta-evaluation: discriminative power, ranking consistency,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsMeta-analysis and systematic reviews · Domain Adaptation and Few-Shot Learning · Explainable Artificial Intelligence (XAI)
