Bug Prediction Using Source Code Embedding Based on Doc2Vec
Tam\'as Aladics, Judit J\'asz, Rudolf Ferenc

TL;DR
This paper introduces a source code embedding method based on Doc2Vec and ASTs, demonstrating improved bug prediction accuracy over traditional code metrics across various machine learning models.
Contribution
It presents a novel source code representation using AST-based Doc2Vec embeddings for bug prediction, outperforming metric-based features.
Findings
Embedding improves bug prediction accuracy in most cases
Embedding is at least as effective as code metrics alone
Various machine learning models benefit from the embedding
Abstract
Bug prediction is a resource demanding task that is hard to automate using static source code analysis. In many fields of computer science, machine learning has proven to be extremely useful in tasks like this, however, for it to work we need a way to use source code as input. We propose a simple, but meaningful representation for source code based on its abstract syntax tree and the Doc2Vec embedding algorithm. This representation maps the source code to a fixed length vector which can be used for various upstream tasks -- one of which is bug prediction. We measured this approach's validity by itself and its effectiveness compared to bug prediction based solely on code metrics. We also experimented on numerous machine learning approaches to check the connection between different embedding parameters with different machine learning models. Our results show that this representation…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware Engineering Research · Software Reliability and Analysis Research · Advanced Malware Detection Techniques
