Learning and Evaluating Contextual Embedding of Source Code
Aditya Kanade, Petros Maniatis, Gogul Balakrishnan, Kensen Shi

TL;DR
This paper introduces CuBERT, a pre-trained contextual embedding model for source code, trained on a large Python corpus, and evaluates it across multiple program-understanding tasks, demonstrating superior performance over existing models.
Contribution
The paper develops CuBERT, a high-quality pre-trained code embedding, and provides a comprehensive benchmark for evaluating source code understanding models.
Findings
CuBERT outperforms existing models on multiple tasks.
Pre-training on large code corpus improves understanding accuracy.
Fewer labeled examples needed for fine-tuning CuBERT.
Abstract
Recent research has achieved impressive results on understanding and improving source code by building up on machine-learning techniques developed for natural languages. A significant advancement in natural-language understanding has come with the development of pre-trained contextual embeddings, such as BERT, which can be fine-tuned for downstream tasks with less labeled data and training budget, while achieving better accuracies. However, there is no attempt yet to obtain a high-quality contextual embedding of source code, and to evaluate it on multiple program-understanding tasks simultaneously; that is the gap that this paper aims to mitigate. Specifically, first, we curate a massive, deduplicated corpus of 7.4M Python files from GitHub, which we use to pre-train CuBERT, an open-sourced code-understanding BERT model; and, second, we create an open-sourced benchmark that comprises…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsSoftware Engineering Research · Software Reliability and Analysis Research · Advanced Malware Detection Techniques
MethodsLinear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · CuBERT · Sigmoid Activation · Tanh Activation · Long Short-Term Memory · Label Smoothing · Bidirectional LSTM · Byte Pair Encoding
