CodeMatcher: Searching Code Based on Sequential Semantics of Important Query Words
Chao Liu, Xin Xia, David Lo, Zhiwei Liu, Ahmed E. Hassan, and Shanping, Li

TL;DR
CodeMatcher is a novel code search model that combines IR techniques with sequential semantic analysis to significantly improve search accuracy and speed over existing models and search engines.
Contribution
It introduces an IR-based code search approach that leverages sequential semantics of important query words, outperforming prior deep learning models in accuracy and efficiency.
Findings
Achieves an MRR of 0.60, outperforming DeepCS, CodeHow, and UNIF.
Over 1.200 times faster than DeepCS.
Outperforms GitHub and Google search in code retrieval accuracy.
Abstract
To accelerate software development, developers frequently search and reuse existing code snippets from a large-scale codebase, e.g., GitHub. Over the years, researchers proposed many information retrieval based models for code search, but they fail to connect the semantic gap between query and code. An early successful deep learning based model DeepCS solved this issue by learning the relationship between pairs of code methods and corresponding natural language descriptions. Two major advantages of DeepCS are the capability of understanding irrelevant/noisy keywords and capturing sequential relationships between words in query and code. In this paper, we proposed an IR-based model CodeMatcher that inherits the advantages of DeepCS, while it can leverage the indexing technique in the IR-based model to accelerate the search response time substantially. CodeMatcher first collects metadata…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
