AuthAttLyzer-V2: Unveiling Code Authorship Attribution using Enhanced   Ensemble Learning Models & Generating Benchmark Dataset

Bhaskar Joshi; Sepideh HajiHossein Khani; Arash HabibiLashkari

arXiv:2406.19896·cs.SE·July 1, 2024

AuthAttLyzer-V2: Unveiling Code Authorship Attribution using Enhanced Ensemble Learning Models & Generating Benchmark Dataset

Bhaskar Joshi, Sepideh HajiHossein Khani, Arash HabibiLashkari

PDF

Open Access

TL;DR

This paper introduces AuthAttLyzer-V2, an advanced ensemble learning framework for source code authorship attribution that leverages diverse features and a new benchmark dataset to improve accuracy and interpretability in cybersecurity applications.

Contribution

AuthAttLyzer-V2 presents a novel feature extraction method and combines ensemble models with SHAP interpretability, advancing the state-of-the-art in code authorship attribution for cybersecurity.

Findings

01

Ensemble models effectively identify individual coding styles.

02

The new dataset enables robust evaluation of authorship attribution methods.

03

SHAP enhances interpretability of model decisions.

Abstract

Source Code Authorship Attribution (SCAA) is crucial for software classification because it provides insights into the origin and behavior of software. By accurately identifying the author or group behind a piece of code, experts can better understand the motivations and techniques of developers. In the cybersecurity era, this attribution helps trace the source of malicious software, identify patterns in the code that may indicate specific threat actors or groups, and ultimately enhance threat intelligence and mitigation strategies. This paper presents AuthAttLyzer-V2, a new source code feature extractor for SCAA, focusing on lexical, semantic, syntactic, and N-gram features. Our research explores author identification in C++ by examining 24,000 source code samples from 3,000 authors. Our methodology integrates Random Forest, Gradient Boosting, and XGBoost models, enhanced with SHAP for…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAuthorship Attribution and Profiling · Hate Speech and Cyberbullying Detection

MethodsShapley Additive Explanations