Revisiting Code Search in a Two-Stage Paradigm
Fan Hu, Yanlin Wang, Lun Du, Xirong Li, Hongyu Zhang, Shi Han, Dongmei, Zhang

TL;DR
This paper introduces TOSS, a two-stage code search framework that combines IR, bi-encoder, and cross-encoder models to improve accuracy and efficiency in code retrieval tasks across multiple languages.
Contribution
TOSS effectively fuses different code search methods in a two-stage process, achieving state-of-the-art accuracy with improved efficiency over existing approaches.
Findings
TOSS outperforms baseline models with an MRR of 0.763.
TOSS is efficient and effective across multiple programming languages.
Compared to six data fusion methods, TOSS achieves superior results.
Abstract
With a good code search engine, developers can reuse existing code snippets and accelerate software development process. Current code search methods can be divided into two categories: traditional information retrieval (IR) based and deep learning (DL) based approaches. DL-based approaches include the cross-encoder paradigm and the bi-encoder paradigm. However, both approaches have certain limitations. The inference of IR-based and bi-encoder models are fast, however, they are not accurate enough; while cross-encoder models can achieve higher search accuracy but consume more time. In this work, we propose TOSS, a two-stage fusion code search framework that can combine the advantages of different code search methods. TOSS first uses IR-based and bi-encoder models to efficiently recall a small number of top-k code candidates, and then uses fine-grained cross-encoders for finer ranking.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware Engineering Research · Software Testing and Debugging Techniques · Topic Modeling
