On the Importance of Building High-quality Training Datasets for Neural Code Search
Zhensu Sun, Li Li, Yan Liu, Xiaoning Du, Li Li

TL;DR
This paper emphasizes the critical role of high-quality datasets in neural code search, revealing noise issues in existing datasets and proposing a novel semantic data cleaning framework that significantly enhances model performance.
Contribution
It introduces the first semantic query cleaning framework for code search datasets, improving data quality and neural model effectiveness.
Findings
Filtering improves DeepCS model performance by 19.2% MRR
Semantic cleaning reduces query noise significantly
Enhanced datasets lead to better real-world code search results
Abstract
The performance of neural code search is significantly influenced by the quality of the training data from which the neural models are derived. A large corpus of high-quality query and code pairs is demanded to establish a precise mapping from the natural language to the programming language. Due to the limited availability, most widely-used code search datasets are established with compromise, such as using code comments as a replacement of queries. Our empirical study on a famous code search dataset reveals that over one-third of its queries contain noises that make them deviate from natural user queries. Models trained through noisy data are faced with severe performance degradation when applied in real-world scenarios. To improve the dataset quality and make the queries of its samples semantically identical to real user queries is critical for the practical usability of neural code…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning and Data Classification · Advanced Neural Network Applications · Topic Modeling
