AuthAttLyzer-V2: Unveiling Code Authorship Attribution using Enhanced Ensemble Learning Models & Generating Benchmark Dataset
Bhaskar Joshi, Sepideh HajiHossein Khani, Arash HabibiLashkari

TL;DR
This paper introduces AuthAttLyzer-V2, an advanced ensemble learning framework for source code authorship attribution that leverages diverse features and a new benchmark dataset to improve accuracy and interpretability in cybersecurity applications.
Contribution
AuthAttLyzer-V2 presents a novel feature extraction method and combines ensemble models with SHAP interpretability, advancing the state-of-the-art in code authorship attribution for cybersecurity.
Findings
Ensemble models effectively identify individual coding styles.
The new dataset enables robust evaluation of authorship attribution methods.
SHAP enhances interpretability of model decisions.
Abstract
Source Code Authorship Attribution (SCAA) is crucial for software classification because it provides insights into the origin and behavior of software. By accurately identifying the author or group behind a piece of code, experts can better understand the motivations and techniques of developers. In the cybersecurity era, this attribution helps trace the source of malicious software, identify patterns in the code that may indicate specific threat actors or groups, and ultimately enhance threat intelligence and mitigation strategies. This paper presents AuthAttLyzer-V2, a new source code feature extractor for SCAA, focusing on lexical, semantic, syntactic, and N-gram features. Our research explores author identification in C++ by examining 24,000 source code samples from 3,000 authors. Our methodology integrates Random Forest, Gradient Boosting, and XGBoost models, enhanced with SHAP for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAuthorship Attribution and Profiling · Hate Speech and Cyberbullying Detection
MethodsShapley Additive Explanations
