How to get better embeddings with code pre-trained models? An empirical   study

Yu Zhao; Lina Gong; Haoxiang Zhang; Yaoshen Yu; Zhiqiu; Huang

arXiv:2311.08066·cs.SE·November 15, 2023·1 cites

How to get better embeddings with code pre-trained models? An empirical study

Yu Zhao, Lina Gong, Haoxiang Zhang, Yaoshen Yu, Zhiqiu, Huang

PDF

Open Access

TL;DR

This empirical study evaluates various code pre-trained models for software engineering tasks, revealing that token aggregation methods and model architecture significantly impact embedding quality and semantic richness.

Contribution

The paper systematically compares five code PTMs across different architectures and tasks, providing insights into effective embedding strategies for SE applications.

Findings

01

Token-based embeddings do not fully capture code semantics.

02

Combining code and text data as in pre-training yields poor embeddings.

03

Decoder-only PTMs can produce high-quality, semantically rich code embeddings.

Abstract

Pre-trained language models have demonstrated powerful capabilities in the field of natural language processing (NLP). Recently, code pre-trained model (PTM), which draw from the experiences of the NLP field, have also achieved state-of-the-art results in many software engineering (SE) downstream tasks. These code PTMs take into account the differences between programming languages and natural languages during pre-training and make adjustments to pre-training tasks and input data. However, researchers in the SE community still inherit habits from the NLP field when using these code PTMs to generate embeddings for SE downstream classification tasks, such as generating semantic embeddings for code snippets through special tokens and inputting code and text information in the same way as pre-training the PTMs. In this paper, we empirically study five different PTMs (i.e. CodeBERT, CodeT5,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSoftware Engineering Research · Software Reliability and Analysis Research · Software System Performance and Reliability