Semantic Source Code Models Using Identifier Embeddings
Vasiliki Efstathiou, Diomidis Spinellis

TL;DR
This paper introduces pretrained vector space models for source code in six programming languages, capturing semantic information from identifiers to aid code search and reuse, leveraging large-scale open source data and fastText embeddings.
Contribution
It presents the first set of multilingual pretrained code models based on identifier embeddings, trained on extensive open source repositories, with detailed analysis of language-specific differences.
Findings
Models trained on over 13,000 repositories per language
Identified differences between natural language and code semantics
Discussed potential applications and limitations of the models
Abstract
The emergence of online open source repositories in the recent years has led to an explosion in the volume of openly available source code, coupled with metadata that relate to a variety of software development activities. As an effect, in line with recent advances in machine learning research, software maintenance activities are switching from symbolic formal methods to data-driven methods. In this context, the rich semantics hidden in source code identifiers provide opportunities for building semantic representations of code which can assist tasks of code search and reuse. To this end, we deliver in the form of pretrained vector space models, distributed code representations for six popular programming languages, namely, Java, Python, PHP, C, C++, and C#. The models are produced using fastText, a state-of-the-art library for learning word representations. Each model is trained on data…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware Engineering Research · Advanced Malware Detection Techniques · Topic Modeling
