Towards Using Data-Influence Methods to Detect Noisy Samples in Source Code Corpora
Anh T. V. Dau, Thang Nguyen-Duc, Hoang Thanh-Tung, Nghi D. Q. Bui

TL;DR
This paper explores adapting data-influence methods to identify noisy samples in source code datasets, aiming to improve the quality of neural code models for practical software engineering applications.
Contribution
It introduces a novel application of data-influence techniques to detect noise in source code corpora for neural models, enhancing data quality for better model performance.
Findings
Data-influence methods can effectively identify noisy samples in source code datasets.
The approach improves the training data quality for neural source code models.
Detection of noisy samples contributes to more reliable code understanding models.
Abstract
Despite the recent trend of developing and applying neural source code models to software engineering tasks, the quality of such models is insufficient for real-world use. This is because there could be noise in the source code corpora used to train such models. We adapt data-influence methods to detect such noises in this paper. Data-influence methods are used in machine learning to evaluate the similarity of a target sample to the correct samples in order to determine whether or not the target sample is noisy. Our evaluation results show that data-influence methods can identify noisy samples from neural code models in classification-based tasks. This approach will contribute to the larger vision of developing better neural source code models from a data-centric perspective, which is a key driver for developing useful source code models in practice.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
