Unveiling Code Pre-Trained Models: Investigating Syntax and Semantics   Capacities

Wei Ma; Shangqing Liu; Mengjie Zhao; Xiaofei Xie; Wenhan Wang; Qiang; Hu; Jie Zhang; Yang Liu

arXiv:2212.10017·cs.SE·April 18, 2024·5 cites

Unveiling Code Pre-Trained Models: Investigating Syntax and Semantics Capacities

Wei Ma, Shangqing Liu, Mengjie Zhao, Xiaofei Xie, Wenhan Wang, Qiang, Hu, Jie Zhang, Yang Liu

PDF

Open Access

TL;DR

This study evaluates seven code pre-trained and large language models to understand their capabilities in capturing code syntax and semantics, revealing strengths in syntax comprehension and variability in semantic understanding.

Contribution

The paper introduces four probing tasks to systematically analyze how different models represent code syntax and semantics, providing new insights into their strengths and weaknesses.

Findings

01

Models are proficient in understanding code syntax.

02

Semantic encoding abilities vary across models.

03

Insights can guide future model improvements.

Abstract

Past research has examined how well these models grasp code syntax, yet their understanding of code semantics still needs to be explored. We extensively analyze seven code models to investigate how code models represent code syntax and semantics. This includes four prominent code pre-trained models (CodeBERT, GraphCodeBERT, CodeT5, and UnixCoder) and three large language models (StarCoder, CodeLlama, and CodeT5+). We have developed four probing tasks to evaluate the models' abilities to learn code syntax and semantics. These tasks focus on reconstructing code syntax and semantic structures-such as AST, CFG, CDG, and DDG - within the models' representation spaces. These structures are fundamental to understanding code. Additionally, we explore the role of syntax tokens in each token representation and the extended dependencies among code tokens. Furthermore, we examine the distribution…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Speech and dialogue systems

MethodsCodeBERT