Reassessing Code Authorship Attribution in the Era of Language Models
Atish Kumar Dipongkor, Ziyu Yao, Kevin Moran

TL;DR
This paper evaluates the effectiveness of large language models for code authorship attribution, revealing their strengths and limitations in identifying coding styles across diverse datasets.
Contribution
It provides the first comprehensive empirical analysis of transformer-based language models for code authorship attribution on multiple datasets.
Findings
Large language models show promising results in CAA tasks.
Model interpretability reveals how LMs understand coding styles.
Insights suggest future directions for improving CAA accuracy.
Abstract
The study of Code Stylometry, and in particular Code Authorship Attribution (CAA), aims to analyze coding styles to identify the authors of code samples. CAA is crucial in cybersecurity and software forensics for addressing, detecting plagiarism, and supporting criminal prosecutions. However, CAA is a complex and error prone task, due to the need for recognizing nuanced relationships between coding patterns. This challenge is compounded in large software systems with numerous authors due to the subtle variability of patterns that signify the coding style of one author among many. Given the challenges related to this task, researchers have proposed and studied automated approaches that rely upon classical Machine Learning and Deep Learning techniques. However, such techniques have historically relied upon hand-crafted features, and due to the often intricate interaction of different…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAuthorship Attribution and Profiling · Software Engineering Research · Spam and Phishing Detection
