On the Importance of Building High-quality Training Datasets for Neural   Code Search

Zhensu Sun; Li Li; Yan Liu; Xiaoning Du; Li Li

arXiv:2202.06649·cs.SE·February 15, 2022·1 cites

On the Importance of Building High-quality Training Datasets for Neural Code Search

Zhensu Sun, Li Li, Yan Liu, Xiaoning Du, Li Li

PDF

Open Access 1 Repo

TL;DR

This paper emphasizes the critical role of high-quality datasets in neural code search, revealing noise issues in existing datasets and proposing a novel semantic data cleaning framework that significantly enhances model performance.

Contribution

It introduces the first semantic query cleaning framework for code search datasets, improving data quality and neural model effectiveness.

Findings

01

Filtering improves DeepCS model performance by 19.2% MRR

02

Semantic cleaning reduces query noise significantly

03

Enhanced datasets lead to better real-world code search results

Abstract

The performance of neural code search is significantly influenced by the quality of the training data from which the neural models are derived. A large corpus of high-quality query and code pairs is demanded to establish a precise mapping from the natural language to the programming language. Due to the limited availability, most widely-used code search datasets are established with compromise, such as using code comments as a replacement of queries. Our empirical study on a famous code search dataset reveals that over one-third of its queries contain noises that make them deviate from natural user queries. Models trained through noisy data are faced with severe performance degradation when applied in real-world scenarios. To improve the dataset quality and make the queries of its samples semantically identical to real user queries is critical for the practical usability of neural code…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

v587su/nlqf
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning and Data Classification · Advanced Neural Network Applications · Topic Modeling