CodeBERT: A Pre-Trained Model for Programming and Natural Languages
Zhangyin Feng, Daya Guo, Duyu Tang, Nan Duan, Xiaocheng Feng, Ming, Gong, Linjun Shou, Bing Qin, Ting Liu, Daxin Jiang, Ming Zhou

TL;DR
CodeBERT is a bimodal pre-trained Transformer model that learns representations for programming and natural languages, enabling improved performance on code search and documentation tasks, and demonstrating knowledge transfer in zero-shot settings.
Contribution
It introduces CodeBERT, a novel bimodal pre-trained model for programming and natural languages, with a hybrid training objective that leverages both bimodal and unimodal data.
Findings
Achieves state-of-the-art results on code search and documentation generation.
Performs well in zero-shot NL-PL probing tasks.
Utilizes a hybrid training objective with replaced token detection.
Abstract
We present CodeBERT, a bimodal pre-trained model for programming language (PL) and nat-ural language (NL). CodeBERT learns general-purpose representations that support downstream NL-PL applications such as natural language codesearch, code documentation generation, etc. We develop CodeBERT with Transformer-based neural architecture, and train it with a hybrid objective function that incorporates the pre-training task of replaced token detection, which is to detect plausible alternatives sampled from generators. This enables us to utilize both bimodal data of NL-PL pairs and unimodal data, where the former provides input tokens for model training while the latter helps to learn better generators. We evaluate CodeBERT on two NL-PL applications by fine-tuning model parameters. Results show that CodeBERT achieves state-of-the-art performance on both natural language code search and code…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗Salesforce/codet5-base-multi-summodel· 720 dl· ♡ 32720 dl♡ 32
- 🤗microsoft/codebert-base-mlmmodel· 12k dl· ♡ 4712k dl♡ 47
- 🤗microsoft/codebert-basemodel· 246k dl· ♡ 283246k dl♡ 283
- 🤗mrm8488/codebert-base-finetuned-detect-insecure-codemodel· 861 dl· ♡ 33861 dl♡ 33
- 🤗claudios/codebert-basemodel· 22 dl22 dl
- 🤗claudios/codebert-base-mlmmodel· 4 dl4 dl
- 🤗Santiago-ampudia/codet5-base-multi-summodel· 1 dl1 dl
- 🤗TheFatBlue/codebert-finetuned-poisonedmodel· 13 dl13 dl
- 🤗onnx-community/codebert-base-ONNXmodel· 28 dl28 dl
- 🤗AfricaKing/codeBERTmodel· 12 dl12 dl
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Software Engineering Research
MethodsCodeBERT
