Complexity-based code embeddings
Rares Folea, Radu Iacob, Emil Slusanschi, Traian Rebedea

TL;DR
This paper introduces a method to convert source code into numerical embeddings using complexity analysis, enabling improved machine learning performance on code classification tasks.
Contribution
It proposes a novel complexity-based code embedding technique and demonstrates its effectiveness with an XGBoost classifier on real-world programming competition data.
Findings
Achieved high F1-score on multi-label code classification
Demonstrated the effectiveness of complexity-based embeddings
Provided a general framework for code representation
Abstract
This paper presents a generic method for transforming the source code of various algorithms to numerical embeddings, by dynamically analysing the behaviour of computer programs against different inputs and by tailoring multiple generic complexity functions for the analysed metrics. The used algorithms embeddings are based on r-Complexity . Using the proposed code embeddings, we present an implementation of the XGBoost algorithm that achieves an average F1-score on a multi-label dataset with 11 classes, built using real-world code snippets submitted for programming competitions on the Codeforces platform.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning and Data Classification · Software Engineering Research · Parallel Computing and Optimization Techniques
