CodeBERTScore: Evaluating Code Generation with Pretrained Models of Code
Shuyan Zhou, Uri Alon, Sumit Agarwal, Graham Neubig

TL;DR
CodeBERTScore is a new evaluation metric for code generation that considers the natural language input and code output, showing higher correlation with human preferences and correctness across multiple programming languages.
Contribution
We introduce CodeBERTScore, a novel metric that improves code evaluation by modeling the relationship between natural language prompts and generated code, outperforming existing metrics.
Findings
CodeBERTScore correlates better with human preferences than existing metrics.
It also aligns more closely with functional correctness of generated code.
Our models are widely adopted, with over 1 million downloads.
Abstract
Since the rise of neural natural-language-to-code models (NL->Code) that can generate long expressions and statements rather than a single next-token, one of the major problems has been reliably evaluating their generated output. In this paper, we propose CodeBERTScore: an evaluation metric for code generation, which builds on BERTScore (Zhang et al., 2020). Instead of encoding only the generated tokens as in BERTScore, CodeBERTScore also encodes the natural language input preceding the generated code, thus modeling the consistency between the generated code and its given natural language context as well. We perform an extensive evaluation of CodeBERTScore across four programming languages. We find that CodeBERTScore achieves a higher correlation with human preference and with functional correctness than all existing metrics. That is, generated code that receives a higher score by…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗neulab/codebert-pythonmodel· 96k dl· ♡ 2696k dl♡ 26
- 🤗neulab/codebert-javascriptmodel· 201 dl· ♡ 15201 dl♡ 15
- 🤗neulab/codebert-cmodel· 18k dl· ♡ 618k dl♡ 6
- 🤗neulab/codebert-cppmodel· 1.1k dl· ♡ 111.1k dl♡ 11
- 🤗neulab/codebert-javamodel· 55k dl· ♡ 1355k dl♡ 13
- 🤗onnx-community/codebert-javascript-ONNXmodel· 1 dl1 dl
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware Engineering Research · Topic Modeling · Machine Learning in Materials Science
MethodsBalanced Selection · CodeBERT
