Detecting Multiple Semantic Concerns in Tangled Code Commits
Beomsu Koh, Neil Walkinshaw, Donghwan Shin

TL;DR
This paper investigates using small language models to detect multiple semantic concerns in tangled code commits, demonstrating that fine-tuned models can effectively identify up to three concerns, especially when commit messages are included.
Contribution
It formulates multi-concern detection as a multi-label classification problem and evaluates the effectiveness of small language models with various techniques on a controlled dataset.
Findings
Fine-tuned 14B-parameter SLMs are competitive with state-of-the-art LLMs for single concerns.
SLMs can detect up to three concerns effectively.
Including commit messages significantly improves detection accuracy.
Abstract
Code commits in a version control system (e.g., Git) should be atomic, i.e., focused on a single goal, such as adding a feature or fixing a bug. In practice, however, developers often bundle multiple concerns into tangled commits, obscuring intent and complicating maintenance. Recent studies have used Conventional Commits Specification (CCS) and Language Models (LMs) to capture commit intent, demonstrating that Small Language Models (SLMs) can approach the performance of Large Language Models (LLMs) while maintaining efficiency and privacy. However, they do not address tangled commits involving multiple concerns, leaving the feasibility of using LMs for multi-concern detection unresolved. In this paper, we frame multi-concern detection in tangled commits as a multi-label classification problem and construct a controlled dataset of artificially tangled commits based on real-world data.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware Engineering Research · Advanced Malware Detection Techniques · Software Testing and Debugging Techniques
