Improving Code Search with Hard Negative Sampling Based on Fine-tuning

Hande Dong; Jiayi Lin; Yanlin Wang; Yichong Leng; Jiawei Chen; Yutao; Xie

arXiv:2305.04508·cs.SE·November 25, 2024·2 cites

Improving Code Search with Hard Negative Sampling Based on Fine-tuning

Hande Dong, Jiayi Lin, Yanlin Wang, Yichong Leng, Jiawei Chen, Yutao, Xie

PDF

Open Access 1 Repo

TL;DR

This paper proposes a novel code search framework combining dual-encoder and cross-encoder architectures with hard negative sampling to improve accuracy and efficiency in retrieving relevant code snippets.

Contribution

It introduces a Retriever-Ranker framework with a ranking-based hard negative sampling method, enhancing code search performance over existing models.

Findings

01

Outperforms baseline models on four datasets.

02

Hard negative sampling improves cross-encoder discrimination.

03

Cascaded framework balances efficiency and accuracy.

Abstract

Pre-trained code models have emerged as the state-of-the-art paradigm for code search tasks. The paradigm involves pre-training the model on search-irrelevant tasks such as masked language modeling, followed by the fine-tuning stage, which focuses on the search-relevant task. The typical fine-tuning method is to employ a dual-encoder architecture to encode semantic embeddings of query and code separately, and then calculate their similarity based on the embeddings. However, the typical dual-encoder architecture falls short in modeling token-level interactions between query and code, which limits the capabilities of model. To address this limitation, we introduce a cross-encoder architecture for code search that jointly encodes the concatenation of query and code. We further introduce a Retriever-Ranker (RR) framework that cascades the dual-encoder and cross-encoder to promote the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

donghande/r2ps
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications