Are Some Words Worth More than Others?

Shiran Dudy; Steven Bedrick

arXiv:2010.06069·cs.CL·October 15, 2020

Are Some Words Worth More than Others?

Shiran Dudy, Steven Bedrick

PDF

1 Repo

TL;DR

This paper introduces two new intrinsic evaluation metrics for language models that better capture linguistic properties and performance variations across word frequencies, revealing differences hidden by traditional accuracy metrics.

Contribution

The paper proposes novel evaluation metrics within a word prediction framework to provide a more comprehensive assessment of language model performance beyond accuracy.

Findings

01

New metrics reveal functional differences between models

02

Traditional metrics are confounded by word frequency effects

03

Proposed measures offer a holistic view of language model behavior

Abstract

Current evaluation metrics for language modeling and generation rely heavily on the accuracy of predicted (or generated) words as compared to a reference ground truth. While important, token-level accuracy only captures one aspect of a language model's behavior, and ignores linguistic properties of words that may allow some mis-predicted tokens to be useful in practice. Furthermore, statistics directly tied to prediction accuracy (including perplexity) may be confounded by the Zipfian nature of written language, as the majority of the prediction attempts will occur with frequently-occurring types. A model's performance may vary greatly between high- and low-frequency words, which in practice could lead to failure modes such as repetitive and dull generated text being produced by a downstream consumer of a language model. To address this, we propose two new intrinsic evaluation measures…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

shiranD/word_level_evaluation
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.