Language Model Augmented Relevance Score
Ruibo Liu, Jason Wei, Soroush Vosoughi

TL;DR
The paper introduces MARS, a context-aware NLG evaluation metric that uses language models and reinforcement learning to generate augmented references, improving correlation with human judgments over existing metrics.
Contribution
MARS is a novel, context-aware evaluation metric that leverages language models and reinforcement learning to better assess NLG outputs against human judgments.
Findings
MARS outperforms seven existing metrics in correlation with human judgments.
MARS better differentiates well-formed from adversarial NLG candidates.
MARS shows higher robustness across multiple NLG tasks.
Abstract
Although automated metrics are commonly used to evaluate NLG systems, they often correlate poorly with human judgements. Newer metrics such as BERTScore have addressed many weaknesses in prior metrics such as BLEU and ROUGE, which rely on n-gram matching. These newer methods, however, are still limited in that they do not consider the generation context, so they cannot properly reward generated text that is correct but deviates from the given reference. In this paper, we propose Language Model Augmented Relevance Score (MARS), a new context-aware metric for NLG evaluation. MARS leverages off-the-shelf language models, guided by reinforcement learning, to create augmented references that consider both the generation context and available human references, which are then used as additional references to score generated text. Compared with seven existing metrics in three common NLG…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
