Learning and Evaluating Contextual Embedding of Source Code

Aditya Kanade; Petros Maniatis; Gogul Balakrishnan; Kensen Shi

arXiv:2001.00059·cs.SE·August 19, 2020·156 cites

Learning and Evaluating Contextual Embedding of Source Code

Aditya Kanade, Petros Maniatis, Gogul Balakrishnan, Kensen Shi

PDF

Open Access 2 Repos 1 Datasets 1 Video

TL;DR

This paper introduces CuBERT, a pre-trained contextual embedding model for source code, trained on a large Python corpus, and evaluates it across multiple program-understanding tasks, demonstrating superior performance over existing models.

Contribution

The paper develops CuBERT, a high-quality pre-trained code embedding, and provides a comprehensive benchmark for evaluating source code understanding models.

Findings

01

CuBERT outperforms existing models on multiple tasks.

02

Pre-training on large code corpus improves understanding accuracy.

03

Fewer labeled examples needed for fine-tuning CuBERT.

Abstract

Recent research has achieved impressive results on understanding and improving source code by building up on machine-learning techniques developed for natural languages. A significant advancement in natural-language understanding has come with the development of pre-trained contextual embeddings, such as BERT, which can be fine-tuned for downstream tasks with less labeled data and training budget, while achieving better accuracies. However, there is no attempt yet to obtain a high-quality contextual embedding of source code, and to evaluate it on multiple program-understanding tasks simultaneously; that is the gap that this paper aims to mitigate. Specifically, first, we curate a massive, deduplicated corpus of 7.4M Python files from GitHub, which we use to pre-train CuBERT, an open-sourced code-understanding BERT model; and, second, we create an open-sourced benchmark that comprises…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Datasets

claudios/cubert_ETHPy150Open
dataset· 59 dl
59 dl

Videos

Learning and Evaluating Contextual Embedding of Source Code· slideslive

Taxonomy

TopicsSoftware Engineering Research · Software Reliability and Analysis Research · Advanced Malware Detection Techniques

MethodsLinear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · CuBERT · Sigmoid Activation · Tanh Activation · Long Short-Term Memory · Label Smoothing · Bidirectional LSTM · Byte Pair Encoding