BERTScore: Evaluating Text Generation with BERT
Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q. Weinberger, Yoav, Artzi

TL;DR
BERTScore is a new automatic evaluation metric for text generation that uses contextual embeddings to compute token similarities, showing improved correlation with human judgments and robustness over existing metrics.
Contribution
It introduces BERTScore, a novel evaluation metric leveraging BERT's contextual embeddings for better assessment of text generation quality.
Findings
BERTScore correlates better with human judgments than existing metrics.
It demonstrates stronger model selection performance.
BERTScore is more robust to adversarial paraphrases.
Abstract
We propose BERTScore, an automatic evaluation metric for text generation. Analogously to common metrics, BERTScore computes a similarity score for each token in the candidate sentence with each token in the reference sentence. However, instead of exact matches, we compute token similarity using contextual embeddings. We evaluate using the outputs of 363 machine translation and image captioning systems. BERTScore correlates better with human judgments and provides stronger model selection performance than existing metrics. Finally, we use an adversarial paraphrase detection task to show that BERTScore is more robust to challenging examples when compared to existing metrics.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications
