Approaching Code Search for Python as a Translation Retrieval Problem   with Dual Encoders

Monoshiz Mahbub Khan; Zhe Yu

arXiv:2410.03431·cs.SE·October 31, 2024

Approaching Code Search for Python as a Translation Retrieval Problem with Dual Encoders

Monoshiz Mahbub Khan, Zhe Yu

PDF

Open Access 1 Repo

TL;DR

This paper presents a novel code search approach for Python that treats it as a translation retrieval problem using dual encoders and a shared embedding space, resulting in improved performance and efficiency.

Contribution

It introduces a unified language model with dual encoders and cosine similarity loss for code search, outperforming previous models in accuracy and computational cost.

Findings

01

Achieves better performance than state-of-the-art models.

02

Reduces computational complexity and training time.

03

Effectively captures language overlap and key term patterns.

Abstract

Code search is vital in the maintenance and extension of software systems. Past works have used separate language models for the natural language and programming language artifacts on models with multiple encoders and different loss functions. Similarly, this work approaches code search for Python as a translation retrieval problem while the natural language queries and the programming language are treated as two types of languages. By using dual encoders, these two types of language sequences are projected onto a shared embedding space, in which the distance reflects the similarity between a given pair of query and code. However, in contrast to previous work, this approach uses a unified language model, and a dual encoder structure with a cosine similarity loss function. A unified language model helps the model take advantage of the considerable overlap of words between the artifacts,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

hil-se/CodeSearch
tfOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Computational Physics and Python Applications · Topic Modeling