CodeSearchNet Challenge: Evaluating the State of Semantic Code Search

Hamel Husain; Ho-Hsiang Wu; Tiferet Gazit; Miltiadis Allamanis; Marc; Brockschmidt

arXiv:1909.09436·cs.LG·June 9, 2020·420 cites

CodeSearchNet Challenge: Evaluating the State of Semantic Code Search

Hamel Husain, Ho-Hsiang Wu, Tiferet Gazit, Miltiadis Allamanis, Marc, Brockschmidt

PDF

Open Access 5 Repos 10 Models 5 Datasets

TL;DR

This paper introduces the CodeSearchNet Challenge, a large-scale benchmark for evaluating semantic code search methods across multiple programming languages, aiming to advance research in bridging natural language and code retrieval.

Contribution

It provides a comprehensive dataset, evaluation methodology, and baseline solutions for semantic code search, fostering progress through a public challenge and leaderboard.

Findings

01

Created a dataset with 6 million functions across 6 languages

02

Annotated 99 natural language queries with expert relevance labels

03

Launched a benchmark and baseline solutions for the task

Abstract

Semantic code search is the task of retrieving relevant code given a natural language query. While related to other information retrieval tasks, it requires bridging the gap between the language used in code (often abbreviated and highly technical) and natural language more suitable to describe vague concepts and ideas. To enable evaluation of progress on code search, we are releasing the CodeSearchNet Corpus and are presenting the CodeSearchNet Challenge, which consists of 99 natural language queries with about 4k expert relevance annotations of likely results from CodeSearchNet Corpus. The corpus contains about 6 million functions from open-source code spanning six programming languages (Go, Java, JavaScript, PHP, Python, and Ruby). The CodeSearchNet Corpus also contains automatically generated query-like natural language for 2 million functions, obtained from mechanically scraping…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSoftware Engineering Research · Software Testing and Debugging Techniques · Software Reliability and Analysis Research