MGS3: A Multi-Granularity Self-Supervised Code Search Framework

Rui Li; Junfeng Kang; Qi Liu; Liyang He; Zheng Zhang; Yunhao Sha; Linbo Zhu; Zhenya Huang

arXiv:2505.24274·cs.SE·June 2, 2025

MGS3: A Multi-Granularity Self-Supervised Code Search Framework

Rui Li, Junfeng Kang, Qi Liu, Liyang He, Zheng Zhang, Yunhao Sha, Linbo Zhu, Zhenya Huang

PDF

Open Access

TL;DR

This paper introduces MGS$^{3}$, a multi-granularity self-supervised framework for code search that leverages hierarchical representations and contrastive learning across different code granularities, improving retrieval accuracy.

Contribution

The paper presents a novel multi-granularity self-supervised contrastive learning framework and a large dataset, addressing the gap in fine-grained code search performance.

Findings

01

Outperforms existing methods on multiple code search benchmarks.

02

Demonstrates effectiveness across various granularities and model architectures.

03

Shows compatibility with pre-trained code representation models.

Abstract

In the pursuit of enhancing software reusability and developer productivity, code search has emerged as a key area, aimed at retrieving code snippets relevant to functionalities based on natural language queries. Despite significant progress in self-supervised code pre-training utilizing the vast amount of code data in repositories, existing methods have primarily focused on leveraging contrastive learning to align natural language with function-level code snippets. These studies have overlooked the abundance of fine-grained (such as block-level and statement-level) code snippets prevalent within the function-level code snippets, which results in suboptimal performance across all levels of granularity. To address this problem, we first construct a multi-granularity code search dataset called MGCodeSearchNet, which contains 536K+ pairs of natural language and code snippets. Subsequently,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques

MethodsContrastive Learning · ALIGN