Boosting Source Code Learning with Text-Oriented Data Augmentation: An Empirical Study
Zeming Dong, Qiang Hu, Yuejun Guo, Zhenya Zhang, Maxime Cordy, Mike, Papadakis, Yves Le Traon, Jianjun Zhao

TL;DR
This paper investigates the effectiveness of text-oriented data augmentation methods, originally designed for natural language, in improving source code learning tasks, demonstrating their benefits even when syntax is slightly broken.
Contribution
It is the first comprehensive empirical study applying natural language data augmentation techniques to source code learning tasks.
Findings
Certain data augmentation methods improve model accuracy and robustness.
Benefits persist even when source code syntax is slightly altered.
The study covers four code problems and four neural network architectures.
Abstract
Recent studies have demonstrated remarkable advancements in source code learning, which applies deep neural networks (DNNs) to tackle various software engineering tasks. Similar to other DNN-based domains, source code learning also requires massive high-quality training data to achieve the success of these applications. Data augmentation, a technique used to produce additional training data, is widely adopted in other domains (e.g. computer vision). However, the existing practice of data augmentation in source code learning is limited to simple syntax-preserved methods, such as code refactoring. In this paper, considering that source code can also be represented as text data, we take an early step to investigate the effectiveness of data augmentation methods originally designed for natural language texts in the context of source code learning. To this end, we focus on code…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware Engineering Research · Software Testing and Debugging Techniques · Software System Performance and Reliability
MethodsMixup
