Exploring the Naturalness of Buggy Code with Recurrent Neural Networks

Jack Lanchantin; Ji Gao

arXiv:1803.08793·cs.SE·March 26, 2018·1 cites

Exploring the Naturalness of Buggy Code with Recurrent Neural Networks

Jack Lanchantin, Ji Gao

PDF

Open Access

TL;DR

This paper investigates the use of LSTM recurrent neural networks to model source code and detect buggy lines by measuring entropy, showing slight improvements over traditional n-gram models.

Contribution

It introduces a novel application of LSTM language models for bug detection in source code, outperforming n-gram models in classification accuracy.

Findings

01

LSTM models slightly outperform n-gram models in bug detection.

02

Entropy-based classification effectively identifies buggy lines.

03

LSTM captures longer-range dependencies in code.

Abstract

Statistical language models are powerful tools which have been used for many tasks within natural language processing. Recently, they have been used for other sequential data such as source code.(Ray et al., 2015) showed that it is possible train an n-gram source code language mode, and use it to predict buggy lines in code by determining "unnatural" lines via entropy with respect to the language model. In this work, we propose using a more advanced language modeling technique, Long Short-term Memory recurrent neural networks, to model source code and classify buggy lines based on entropy. We show that our method slightly outperforms an n-gram model in the buggy line classification task using AUC.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSoftware Engineering Research · Natural Language Processing Techniques · Topic Modeling