CodeBERTScore: Evaluating Code Generation with Pretrained Models of Code

Shuyan Zhou; Uri Alon; Sumit Agarwal; Graham Neubig

arXiv:2302.05527·cs.SE·November 1, 2023·6 cites

CodeBERTScore: Evaluating Code Generation with Pretrained Models of Code

Shuyan Zhou, Uri Alon, Sumit Agarwal, Graham Neubig

PDF

Open Access 1 Repo 6 Models

TL;DR

CodeBERTScore is a new evaluation metric for code generation that considers the natural language input and code output, showing higher correlation with human preferences and correctness across multiple programming languages.

Contribution

We introduce CodeBERTScore, a novel metric that improves code evaluation by modeling the relationship between natural language prompts and generated code, outperforming existing metrics.

Findings

01

CodeBERTScore correlates better with human preferences than existing metrics.

02

It also aligns more closely with functional correctness of generated code.

03

Our models are widely adopted, with over 1 million downloads.

Abstract

Since the rise of neural natural-language-to-code models (NL->Code) that can generate long expressions and statements rather than a single next-token, one of the major problems has been reliably evaluating their generated output. In this paper, we propose CodeBERTScore: an evaluation metric for code generation, which builds on BERTScore (Zhang et al., 2020). Instead of encoding only the generated tokens as in BERTScore, CodeBERTScore also encodes the natural language input preceding the generated code, thus modeling the consistency between the generated code and its given natural language context as well. We perform an extensive evaluation of CodeBERTScore across four programming languages. We find that CodeBERTScore achieves a higher correlation with human preference and with functional correctness than all existing metrics. That is, generated code that receives a higher score by…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

neulab/code-bert-score
pytorchOfficial

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSoftware Engineering Research · Topic Modeling · Machine Learning in Materials Science

MethodsBalanced Selection · CodeBERT