GraphCodeBERT: Pre-training Code Representations with Data Flow
Daya Guo, Shuo Ren, Shuai Lu, Zhangyin Feng, Duyu Tang, Shujie Liu,, Long Zhou, Nan Duan, Alexey Svyatkovskiy, Shengyu Fu, Michele Tufano, Shao, Kun Deng, Colin Clement, Dawn Drain, Neel Sundaresan, Jian Yin, Daxin Jiang,, Ming Zhou

TL;DR
GraphCodeBERT is a pre-trained model that incorporates data flow structures of code, improving understanding and performance on various code-related tasks by using semantic-level structures instead of syntactic ones.
Contribution
It introduces a novel pre-training approach using data flow graphs and structure-aware tasks, enhancing code understanding beyond token sequences.
Findings
Achieves state-of-the-art results on code search, clone detection, code translation, and code refinement.
Incorporating data flow structures improves model performance.
Structure-level attention is preferred over token-level in code search.
Abstract
Pre-trained models for programming language have achieved dramatic empirical improvements on a variety of code-related tasks such as code search, code completion, code summarization, etc. However, existing pre-trained models regard a code snippet as a sequence of tokens, while ignoring the inherent structure of code, which provides crucial code semantics and would enhance the code understanding process. We present GraphCodeBERT, a pre-trained model for programming language that considers the inherent structure of code. Instead of taking syntactic-level structure of code like abstract syntax tree (AST), we use data flow in the pre-training stage, which is a semantic-level structure of code that encodes the relation of "where-the-value-comes-from" between variables. Such a semantic-level structure is neat and does not bring an unnecessarily deep hierarchy of AST, the property of which…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsSoftware Engineering Research · Topic Modeling · Software Testing and Debugging Techniques
MethodsLinear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Layer Normalization · Dropout · Dense Connections · Byte Pair Encoding · Label Smoothing · Multi-Head Attention · Attention Is All You Need
