WRDScore: New Metric for Evaluation of Natural Language Generation Models
Ravil Mussabayev

TL;DR
This paper introduces WRDScore, a novel evaluation metric for natural language generation that uses optimal transport theory to better capture semantic and syntactic variations, outperforming traditional metrics.
Contribution
We propose WRDScore, a lightweight, normalized, and effective metric based on optimal transport, addressing limitations of existing evaluation methods for language generation.
Findings
WRDScore correlates better with human judgments than existing metrics.
It balances precision and recall effectively in evaluation.
Experiments show WRDScore's superiority over traditional metrics.
Abstract
Evaluating natural language generation models, particularly for method name prediction, poses significant challenges. A robust metric must account for the versatility of method naming, considering both semantic and syntactic variations. Traditional overlap-based metrics, such as ROUGE, fail to capture these nuances. Existing embedding-based metrics often suffer from imbalanced precision and recall, lack normalized scores, or make unrealistic assumptions about sequences. To address these limitations, we leverage the theory of optimal transport and construct WRDScore, a novel metric that strikes a balance between simplicity and effectiveness. In the WRDScore framework, we define precision as the maximum degree to which the predicted sequence's tokens are included in the reference sequence, token by token. Recall is calculated as the total cost of the optimal transport plan that maps the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Speech and dialogue systems
