SCELMo: Source Code Embeddings from Language Models

Rafael - Michael Karampatsis; Charles Sutton

arXiv:2004.13214·cs.SE·April 29, 2020·34 cites

SCELMo: Source Code Embeddings from Language Models

Rafael - Michael Karampatsis, Charles Sutton

PDF

Open Access

TL;DR

This paper introduces SCELMo, a set of deep contextualized code embeddings based on language models, which improve bug detection in software engineering tasks.

Contribution

It applies ELMo-based contextual embeddings to source code, demonstrating their effectiveness in enhancing bug detection systems.

Findings

01

Low-dimensional embeddings improve bug detection accuracy

02

Contextual embeddings outperform non-contextual ones

03

Effective even with small training corpora

Abstract

Continuous embeddings of tokens in computer programs have been used to support a variety of software development tools, including readability, code search, and program repair. Contextual embeddings are common in natural language processing but have not been previously applied in software engineering. We introduce a new set of deep contextualized word representations for computer programs based on language models. We train a set of embeddings using the ELMo (embeddings from language models) framework of Peters et al (2018). We investigate whether these embeddings are effective when fine-tuned for the downstream task of bug detection. We show that even a low-dimensional embedding trained on a relatively small corpus of programs can improve a state-of-the-art machine learning system for bug detection.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSoftware Engineering Research · Software Testing and Debugging Techniques · Natural Language Processing Techniques

MethodsSigmoid Activation · Tanh Activation · Long Short-Term Memory · Bidirectional LSTM · Softmax · ELMo