A Literature Study of Embeddings on Source Code
Zimin Chen, Martin Monperrus

TL;DR
This survey reviews the application of word embedding techniques to source code across various granularities, highlighting successful uses and future potential in code analysis and understanding.
Contribution
It categorizes existing work on code embeddings, provides experimental data links, and visualizes code embeddings, offering a comprehensive overview of current research.
Findings
Word embeddings are successfully applied to different code granularities.
Code embeddings enable visualization and analysis of source code.
Potential for future NLP techniques in source code understanding.
Abstract
Natural language processing has improved tremendously after the success of word embedding techniques such as word2vec. Recently, the same idea has been applied on source code with encouraging results. In this survey, we aim to collect and discuss the usage of word embedding techniques on programs and source code. The articles in this survey have been collected by asking authors of related work and with an extensive search on Google Scholar. Each article is categorized into five categories: 1. embedding of tokens 2. embedding of functions or methods 3. embedding of sequences or sets of method calls 4. embedding of binary code 5. other embeddings. We also provide links to experimental data and show some remarkable visualization of code embeddings. In summary, word embedding has been successfully applied on different granularities of source code. With access to countless open-source…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware Engineering Research · Advanced Malware Detection Techniques · Software Testing and Debugging Techniques
