How to Select Pre-Trained Code Models for Reuse? A Learning Perspective

Zhangqian Bi; Yao Wan; Zhaoyang Chu; Yufei Hu; Junyi Zhang; Hongyu; Zhang; Guandong Xu; Hai Jin

arXiv:2501.03783·cs.SE·January 8, 2025

How to Select Pre-Trained Code Models for Reuse? A Learning Perspective

Zhangqian Bi, Yao Wan, Zhaoyang Chu, Yufei Hu, Junyi Zhang, Hongyu, Zhang, Guandong Xu, Hai Jin

PDF

Open Access 1 Repo

TL;DR

This paper proposes learning-based strategies for selecting pre-trained code models efficiently, significantly reducing selection time while maintaining high performance on code intelligence tasks.

Contribution

It introduces a novel learning-based model selection approach that outperforms traditional methods in efficiency and effectiveness for pre-trained code models.

Findings

01

Learning-based selection reduces time from 2700 hours to 100 seconds.

02

Traditional methods perform poorly or are costly.

03

Proposed methods achieve less than 6% performance loss.

Abstract

Pre-training a language model and then fine-tuning it has shown to be an efficient and effective technique for a wide range of code intelligence tasks, such as code generation, code summarization, and vulnerability detection. However, pretraining language models on a large-scale code corpus is computationally expensive. Fortunately, many off-the-shelf Pre-trained Code Models (PCMs), such as CodeBERT, CodeT5, CodeGen, and Code Llama, have been released publicly. These models acquire general code understanding and generation capability during pretraining, which enhances their performance on downstream code intelligence tasks. With an increasing number of these public pre-trained models, selecting the most suitable one to reuse for a specific task is essential. In this paper, we systematically investigate the reusability of PCMs. We first explore three intuitive model selection methods…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

CGCL-codes/naturalcc
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Model-Driven Software Engineering Techniques · Mathematics, Computing, and Information Processing

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Byte Pair Encoding · Gated Linear Unit · Residual Connection · Dropout · SentencePiece · Softmax · Linear Layer · Inverse Square Root Schedule