Embedding Java Classes with code2vec: Improvements from Variable Obfuscation
Rhys Compton, Eibe Frank, Panos Patros, Abigail Koay

TL;DR
This paper improves code2vec embeddings by obfuscating variable names during training, making the embeddings more robust to naming variations and better at capturing code semantics, enabling class-level analysis.
Contribution
It introduces variable obfuscation during training and a method to create class-level embeddings, addressing key limitations of code2vec.
Findings
Obfuscating variable names enhances model robustness against naming variations.
Class-level embeddings can be effectively created by aggregating method embeddings.
Obfuscated models better reflect true code semantics and resist adversarial attacks.
Abstract
Automatic source code analysis in key areas of software engineering, such as code security, can benefit from Machine Learning (ML). However, many standard ML approaches require a numeric representation of data and cannot be applied directly to source code. Thus, to enable ML, we need to embed source code into numeric feature vectors while maintaining the semantics of the code as much as possible. code2vec is a recently released embedding approach that uses the proxy task of method name prediction to map Java methods to feature vectors. However, experimentation with code2vec shows that it learns to rely on variable names for prediction, causing it to be easily fooled by typos or adversarial attacks. Moreover, it is only able to embed individual Java methods and cannot embed an entire collection of methods such as those present in a typical Java class, making it difficult to perform…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware Engineering Research · Advanced Malware Detection Techniques · Software Reliability and Analysis Research
