Tackling Long Code Search with Splitting, Encoding, and Aggregating
Fan Hu, Yanlin Wang, Lun Du, Hongyu Zhang, Shi Han, Dongmei Zhang,, Xirong Li

TL;DR
This paper introduces SEA, a method that splits long code into blocks, encodes, and aggregates them to improve Transformer-based code search without retraining, significantly enhancing performance on long code retrieval tasks.
Contribution
SEA provides a simple yet effective baseline for long code search by splitting, encoding, and aggregating code, compatible with existing pretrained models without re-pretraining.
Findings
SEA outperforms existing models on CodeSearchNet benchmark
Achieves 10.1% higher mean reciprocal rank than GraphCodeBERT
Enables effective long code modeling without changing internal model structures
Abstract
Code search with natural language helps us reuse existing code snippets. Thanks to the Transformer-based pretraining models, the performance of code search has been improved significantly. However, due to the quadratic complexity of multi-head self-attention, there is a limit on the input token length. For efficient training on standard GPUs like V100, existing pretrained code models, including GraphCodeBERT, CodeBERT, RoBERTa (code), take the first 256 tokens by default, which makes them unable to represent the complete information of long code that is greater than 256 tokens. To tackle the long code problem, we propose a new baseline SEA (Split, Encode and Aggregate), which splits long code into code blocks, encodes these blocks into embeddings, and aggregates them to obtain a comprehensive long code representation. With SEA, we could directly use Transformer-based pretraining models…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAlgorithms and Data Compression · Web Data Mining and Analysis · Natural Language Processing Techniques
