SCC: Automatic Classification of Code Snippets
Kamel Alreshedy, Dhanush Dharmaretnam, Daniel M. German, Venkatesh, Srinivasan, T. Aaron Gulliver

TL;DR
This paper introduces SCC, a machine learning-based classifier that accurately identifies the programming language of code snippets across 21 languages, outperforming existing online classifiers and distinguishing language variants.
Contribution
The paper presents a novel ML approach using Naive Bayes trained on Stack Overflow data for snippet classification, achieving higher accuracy than proprietary tools.
Findings
Achieved 75% accuracy in classifying 21 programming languages.
Outperformed the proprietary PLI classifier with 55.5% accuracy.
Can distinguish between language variants like C# versions.
Abstract
Determining the programming language of a source code file has been considered in the research community; it has been shown that Machine Learning (ML) and Natural Language Processing (NLP) algorithms can be effective in identifying the programming language of source code files. However, determining the programming language of a code snippet or a few lines of source code is still a challenging task. Online forums such as Stack Overflow and code repositories such as GitHub contain a large number of code snippets. In this paper, we describe Source Code Classification (SCC), a classifier that can identify the programming language of code snippets written in 21 different programming languages. A Multinomial Naive Bayes (MNB) classifier is employed which is trained using Stack Overflow posts. It is shown to achieve an accuracy of 75% which is higher than that with Programming Languages…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware Engineering Research · Advanced Malware Detection Techniques · Topic Modeling
