Which Features are Learned by CodeBert: An Empirical Study of the   BERT-based Source Code Representation Learning

Lan Zhang; Chen Cao; Zhilong Wang; Peng Liu

arXiv:2301.08427·cs.CL·August 14, 2023

Which Features are Learned by CodeBert: An Empirical Study of the BERT-based Source Code Representation Learning

Lan Zhang, Chen Cao, Zhilong Wang, Peng Liu

PDF

Open Access

TL;DR

This paper empirically investigates what features CodeBERT learns for source code representation, revealing its reliance on variable and function names rather than understanding code logic.

Contribution

It provides an empirical analysis showing that CodeBERT primarily captures identifier information, highlighting limitations in understanding source code logic.

Findings

01

CodeBERT heavily relies on variable and function names.

02

Current models do not effectively understand source code logic.

03

Insights suggest directions for improving code representation learning.

Abstract

The Bidirectional Encoder Representations from Transformers (BERT) were proposed in the natural language process (NLP) and shows promising results. Recently researchers applied the BERT to source-code representation learning and reported some good news on several downstream tasks. However, in this paper, we illustrated that current methods cannot effectively understand the logic of source codes. The representation of source code heavily relies on the programmer-defined variable and function names. We design and implement a set of experiments to demonstrate our conjecture and provide some insights for future works.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSoftware Engineering Research · Advanced Malware Detection Techniques · Software Reliability and Analysis Research

MethodsMulti-Head Attention · Attention Is All You Need · Linear Warmup With Linear Decay · Linear Layer · Dropout · Weight Decay · WordPiece · Refunds@Expedia|||How do I get a full refund from Expedia? · Attention Dropout · Residual Connection