SECRET: Towards Scalable and Efficient Code Retrieval via Segmented Deep   Hashing

Wenchao Gu; Ensheng Shi; Yanlin Wang; Lun Du; Shi Han; Hongyu Zhang,; Dongmei Zhang; Michael R. Lyu

arXiv:2412.11728·cs.SE·December 17, 2024

SECRET: Towards Scalable and Efficient Code Retrieval via Segmented Deep Hashing

Wenchao Gu, Ensheng Shi, Yanlin Wang, Lun Du, Shi Han, Hongyu Zhang,, Dongmei Zhang, Michael R. Lyu

PDF

TL;DR

SECRET introduces a segmented deep hashing method that significantly accelerates large-scale code retrieval by converting long hash codes into shorter segments, reducing retrieval time by over 95% while maintaining high accuracy.

Contribution

The paper proposes SECRET, a novel segmented deep hashing approach that enhances scalability and efficiency in code retrieval by using multiple hash code segments for faster lookup.

Findings

01

Reduces retrieval time by at least 95%.

02

Achieves comparable or higher retrieval performance.

03

Outperforms classical LSH in efficiency and accuracy.

Abstract

Code retrieval, which retrieves code snippets based on users' natural language descriptions, is widely used by developers and plays a pivotal role in real-world software development. The advent of deep learning has shifted the retrieval paradigm from lexical-based matching towards leveraging deep learning models to encode source code and queries into vector representations, facilitating code retrieval according to vector similarity. Despite the effectiveness of these models, managing large-scale code database presents significant challenges. Previous research proposes deep hashing-based methods, which generate hash codes for queries and code snippets and use Hamming distance for rapid recall of code candidates. However, this approach's reliance on linear scanning of the entire code base limits its scalability. To further improve the efficiency of large-scale code retrieval, we propose a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.