MIXCODE: Enhancing Code Classification by Mixup-Based Data Augmentation
Zeming Dong, Qiang Hu, Yuejun Guo, Maxime Cordy, Mike Papadakis,, Zhenya Zhang, Yves Le Traon, and Jianjun Zhao

TL;DR
MIXCODE introduces a novel data augmentation method for code classification that combines code refactoring with Mixup, significantly improving model accuracy and robustness across multiple datasets and architectures.
Contribution
The paper proposes MIXCODE, a new data augmentation technique for source code analysis that leverages code refactoring and Mixup, outperforming existing simple augmentation methods.
Findings
MIXCODE improves accuracy by up to 6.24%.
MIXCODE enhances robustness by up to 26.06%.
Effective across multiple programming languages and models.
Abstract
Inspired by the great success of Deep Neural Networks (DNNs) in natural language processing (NLP), DNNs have been increasingly applied in source code analysis and attracted significant attention from the software engineering community. Due to its data-driven nature, a DNN model requires massive and high-quality labeled training data to achieve expert-level performance. Collecting such data is often not hard, but the labeling process is notoriously laborious. The task of DNN-based code analysis even worsens the situation because source code labeling also demands sophisticated expertise. Data augmentation has been a popular approach to supplement training data in domains such as computer vision and NLP. However, existing data augmentation approaches in code analysis adopt simple methods, such as data transformation and adversarial example generation, thus bringing limited performance…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware Engineering Research · Software Testing and Debugging Techniques · Software Reliability and Analysis Research
MethodsCodeBERT · Mixup
