Which Features are Learned by CodeBert: An Empirical Study of the BERT-based Source Code Representation Learning
Lan Zhang, Chen Cao, Zhilong Wang, Peng Liu

TL;DR
This paper empirically investigates what features CodeBERT learns for source code representation, revealing its reliance on variable and function names rather than understanding code logic.
Contribution
It provides an empirical analysis showing that CodeBERT primarily captures identifier information, highlighting limitations in understanding source code logic.
Findings
CodeBERT heavily relies on variable and function names.
Current models do not effectively understand source code logic.
Insights suggest directions for improving code representation learning.
Abstract
The Bidirectional Encoder Representations from Transformers (BERT) were proposed in the natural language process (NLP) and shows promising results. Recently researchers applied the BERT to source-code representation learning and reported some good news on several downstream tasks. However, in this paper, we illustrated that current methods cannot effectively understand the logic of source codes. The representation of source code heavily relies on the programmer-defined variable and function names. We design and implement a set of experiments to demonstrate our conjecture and provide some insights for future works.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware Engineering Research · Advanced Malware Detection Techniques · Software Reliability and Analysis Research
MethodsMulti-Head Attention · Attention Is All You Need · Linear Warmup With Linear Decay · Linear Layer · Dropout · Weight Decay · WordPiece · Refunds@Expedia|||How do I get a full refund from Expedia? · Attention Dropout · Residual Connection
