Splitting source code identifiers using Bidirectional LSTM Recurrent Neural Network
Vadim Markovtsev, Waren Long, Egor Bulychev, Romain Keramitas,, Konstantin Slavnov, Gabor Markowski

TL;DR
This paper presents a bidirectional LSTM neural network trained on millions of source code identifiers to accurately split them into subtokens, enhancing code understanding and developer productivity.
Contribution
The paper introduces a novel bidirectional LSTM model trained on a large dataset to improve identifier splitting in source code, outperforming previous models.
Findings
Outperforms other machine learning models in identifier splitting accuracy.
Trained on 41.7 million identifiers from over 180,000 projects.
Enables better code analysis and developer tools.
Abstract
Programmers make rich use of natural language in the source code they write through identifiers and comments. Source code identifiers are selected from a pool of tokens which are strongly related to the meaning, naming conventions, and context. These tokens are often combined to produce more precise and obvious designations. Such multi-part identifiers count for 97% of all naming tokens in the Public Git Archive - the largest dataset of Git repositories to date. We introduce a bidirectional LSTM recurrent neural network to detect subtokens in source code identifiers. We trained that network on 41.7 million distinct splittable identifiers collected from 182,014 open source projects in Public Git Archive, and show that it outperforms several other machine learning models. The proposed network can be used to improve the upstream models which are based on source code identifiers, as well as…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware Engineering Research · Advanced Malware Detection Techniques · Web Data Mining and Analysis
MethodsSigmoid Activation · Tanh Activation · Long Short-Term Memory
