Unveiling Code Pre-Trained Models: Investigating Syntax and Semantics Capacities
Wei Ma, Shangqing Liu, Mengjie Zhao, Xiaofei Xie, Wenhan Wang, Qiang, Hu, Jie Zhang, Yang Liu

TL;DR
This study evaluates seven code pre-trained and large language models to understand their capabilities in capturing code syntax and semantics, revealing strengths in syntax comprehension and variability in semantic understanding.
Contribution
The paper introduces four probing tasks to systematically analyze how different models represent code syntax and semantics, providing new insights into their strengths and weaknesses.
Findings
Models are proficient in understanding code syntax.
Semantic encoding abilities vary across models.
Insights can guide future model improvements.
Abstract
Past research has examined how well these models grasp code syntax, yet their understanding of code semantics still needs to be explored. We extensively analyze seven code models to investigate how code models represent code syntax and semantics. This includes four prominent code pre-trained models (CodeBERT, GraphCodeBERT, CodeT5, and UnixCoder) and three large language models (StarCoder, CodeLlama, and CodeT5+). We have developed four probing tasks to evaluate the models' abilities to learn code syntax and semantics. These tasks focus on reconstructing code syntax and semantic structures-such as AST, CFG, CDG, and DDG - within the models' representation spaces. These structures are fundamental to understanding code. Additionally, we explore the role of syntax tokens in each token representation and the extended dependencies among code tokens. Furthermore, we examine the distribution…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Speech and dialogue systems
MethodsCodeBERT
