Using LSTMs to Model the Java Programming Language

Brendon Boldt

arXiv:1908.11685·cs.SE·September 2, 2019

Using LSTMs to Model the Java Programming Language

Brendon Boldt

PDF

TL;DR

This paper demonstrates that LSTMs can effectively model Java source code, outperforming their performance on English language tasks, with potential applications in code synthesis and bug fixing.

Contribution

The study shows that LSTMs trained on Java code can predict code sequences more accurately than on natural language, highlighting their potential in programming language modeling.

Findings

01

LSTMs achieved perplexity under 22 on Java code

02

LSTMs achieved accuracy above 0.47 on Java code

03

Performance surpasses LSTM results on English language tasks

Abstract

Recurrent neural networks (RNNs), specifically long-short term memory networks (LSTMs), can model natural language effectively. This research investigates the ability for these same LSTMs to perform next "word" prediction on the Java programming language. Java source code from four different repositories undergoes a transformation that preserves the logical structure of the source code and removes the code's various specificities such as variable names and literal values. Such datasets and an additional English language corpus are used to train and test standard LSTMs' ability to predict the next element in a sequence. Results suggest that LSTMs can effectively model Java code achieving perplexities under 22 and accuracies above 0.47, which is an improvement over LSTM's performance on the English language which demonstrated a perplexity of 85 and an accuracy of 0.27. This research can…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.