High-quality data augmentation for code comment classification
Thomas Borsani, Andrea Rosani, Giuseppe Di Fatta

TL;DR
This paper introduces novel data augmentation methods to enhance datasets for classifying code comments, significantly improving classifier performance in natural language processing tasks related to software development.
Contribution
The paper proposes synthetic oversampling and augmentation techniques, based on high-quality data generation, to address dataset limitations in code comment classification.
Findings
Q-SYNTH improves classifier accuracy by 2.56%
Enhanced datasets better represent real-world comment distributions
Synthetic data generation boosts NLP performance in software engineering
Abstract
Code comments serve a crucial role in software development for documenting functionality, clarifying design choices, and assisting with issue tracking. They capture developers' insights about the surrounding source code, serving as an essential resource for both human comprehension and automated analysis. Nevertheless, since comments are in natural language, they present challenges for machine-based code understanding. To address this, recent studies have applied natural language processing (NLP) and deep learning techniques to classify comments according to developers' intentions. However, existing datasets for this task suffer from size limitations and class imbalance, as they rely on manual annotations and may not accurately represent the distribution of comments in real-world codebases. To overcome this issue, we introduce new synthetic oversampling and augmentation techniques based…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware Engineering Research · Software Engineering Techniques and Practices · Advanced Malware Detection Techniques
