Approaching Code Search for Python as a Translation Retrieval Problem with Dual Encoders
Monoshiz Mahbub Khan, Zhe Yu

TL;DR
This paper presents a novel code search approach for Python that treats it as a translation retrieval problem using dual encoders and a shared embedding space, resulting in improved performance and efficiency.
Contribution
It introduces a unified language model with dual encoders and cosine similarity loss for code search, outperforming previous models in accuracy and computational cost.
Findings
Achieves better performance than state-of-the-art models.
Reduces computational complexity and training time.
Effectively captures language overlap and key term patterns.
Abstract
Code search is vital in the maintenance and extension of software systems. Past works have used separate language models for the natural language and programming language artifacts on models with multiple encoders and different loss functions. Similarly, this work approaches code search for Python as a translation retrieval problem while the natural language queries and the programming language are treated as two types of languages. By using dual encoders, these two types of language sequences are projected onto a shared embedding space, in which the distance reflects the similarity between a given pair of query and code. However, in contrast to previous work, this approach uses a unified language model, and a dual encoder structure with a cosine similarity loss function. A unified language model helps the model take advantage of the considerable overlap of words between the artifacts,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Computational Physics and Python Applications · Topic Modeling
