Search4Code: Code Search Intent Classification Using Weak Supervision
Nikitha Rao, Chetan Bansal, Joe Guan

TL;DR
This paper introduces Search4Code, a large-scale dataset of web search queries for code search intent classification in C# and Java, and proposes a weak supervision approach that achieves over 75% accuracy.
Contribution
The paper presents the first large-scale real-world dataset of code search queries and a weak supervision method for classifying search intent in programming language queries.
Findings
CNN model achieves 77% accuracy for C#
CNN model achieves 76% accuracy for Java
Release of Search4Code dataset for future research
Abstract
Developers use search for various tasks such as finding code, documentation, debugging information, etc. In particular, web search is heavily used by developers for finding code examples and snippets during the coding process. Recently, natural language based code search has been an active area of research. However, the lack of real-world large-scale datasets is a significant bottleneck. In this work, we propose a weak supervision based approach for detecting code search intent in search queries for C# and Java programming languages. We evaluate the approach against several baselines on a real-world dataset comprised of over 1 million queries mined from Bing web search engine and show that the CNN based model can achieve an accuracy of 77% and 76% for C# and Java respectively. Furthermore, we are also releasing Search4Code, the first large-scale real-world dataset of code search queries…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware Engineering Research · Advanced Malware Detection Techniques · Topic Modeling
