Learning Deep Semantic Model for Code Search using CodeSearchNet Corpus
Chen Wu, Ming Yan

TL;DR
This paper introduces a novel deep semantic model for code search that effectively bridges natural language and code semantics, achieving top performance on the CodeSearchNet benchmark.
Contribution
The paper proposes a new deep semantic model utilizing multi-modal sources, self-attention, and combined representations, advancing neural code search techniques.
Findings
Achieved 0.384 NDCG on CodeSearchNet benchmark
Won first place in the CodeSearchNet challenge
Enhanced representation learning through multi-modal and cross-lingual alignment
Abstract
Semantic code search is the task of retrieving relevant code snippet given a natural language query. Different from typical information retrieval tasks, code search requires to bridge the semantic gap between the programming language and natural language, for better describing intrinsic concepts and semantics. Recently, deep neural network for code search has been a hot research topic. Typical methods for neural code search first represent the code snippet and query text as separate embeddings, and then use vector distance (e.g. dot-product or cosine) to calculate the semantic similarity between them. There exist many different ways for aggregating the variable length of code or query tokens into a learnable embedding, including bi-encoder, cross-encoder, and poly-encoder. The goal of the query encoder and code encoder is to produce embeddings that are close with each other for a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware Engineering Research · Web Data Mining and Analysis · Software Testing and Debugging Techniques
