Tackling Long Code Search with Splitting, Encoding, and Aggregating

Fan Hu; Yanlin Wang; Lun Du; Hongyu Zhang; Shi Han; Dongmei Zhang,; Xirong Li

arXiv:2208.11271·cs.SE·March 27, 2024

Tackling Long Code Search with Splitting, Encoding, and Aggregating

Fan Hu, Yanlin Wang, Lun Du, Hongyu Zhang, Shi Han, Dongmei Zhang,, Xirong Li

PDF

Open Access 1 Repo

TL;DR

This paper introduces SEA, a method that splits long code into blocks, encodes, and aggregates them to improve Transformer-based code search without retraining, significantly enhancing performance on long code retrieval tasks.

Contribution

SEA provides a simple yet effective baseline for long code search by splitting, encoding, and aggregating code, compatible with existing pretrained models without re-pretraining.

Findings

01

SEA outperforms existing models on CodeSearchNet benchmark

02

Achieves 10.1% higher mean reciprocal rank than GraphCodeBERT

03

Enables effective long code modeling without changing internal model structures

Abstract

Code search with natural language helps us reuse existing code snippets. Thanks to the Transformer-based pretraining models, the performance of code search has been improved significantly. However, due to the quadratic complexity of multi-head self-attention, there is a limit on the input token length. For efficient training on standard GPUs like V100, existing pretrained code models, including GraphCodeBERT, CodeBERT, RoBERTa (code), take the first 256 tokens by default, which makes them unable to represent the complete information of long code that is greater than 256 tokens. To tackle the long code problem, we propose a new baseline SEA (Split, Encode and Aggregate), which splits long code into code blocks, encodes these blocks into embeddings, and aggregates them to obtain a comprehensive long code representation. With SEA, we could directly use Transformer-based pretraining models…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

fly-dragon211/SEA
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAlgorithms and Data Compression · Web Data Mining and Analysis · Natural Language Processing Techniques