Code Execution with Pre-trained Language Models

Chenxiao Liu; Shuai Lu; Weizhu Chen; Daxin Jiang; Alexey Svyatkovskiy,; Shengyu Fu; Neel Sundaresan; Nan Duan

arXiv:2305.05383·cs.PL·May 10, 2023·2 cites

Code Execution with Pre-trained Language Models

Chenxiao Liu, Shuai Lu, Weizhu Chen, Daxin Jiang, Alexey Svyatkovskiy,, Shengyu Fu, Neel Sundaresan, Nan Duan

PDF

Open Access 1 Repo 1 Datasets

TL;DR

This paper explores how pre-trained language models can understand and perform code execution, introducing a new dataset and a specialized Transformer model called CodeExecutor to improve semantic comprehension of code.

Contribution

The paper introduces a mutation-based data augmentation technique, a new dataset, and the CodeExecutor model that leverages code execution pre-training and curriculum learning.

Findings

01

CodeExecutor shows promising performance on code execution tasks.

02

Pre-trained models have limitations in understanding code semantics.

03

Potential improvements for code intelligence tasks like code search and generation.

Abstract

Code execution is a fundamental aspect of programming language semantics that reflects the exact behavior of the code. However, most pre-trained models for code intelligence ignore the execution trace and only rely on source code and syntactic structures. In this paper, we investigate how well pre-trained models can understand and perform code execution. We develop a mutation-based data augmentation technique to create a large-scale and realistic Python dataset and task for code execution, which challenges existing models such as Codex. We then present CodeExecutor, a Transformer model that leverages code execution pre-training and curriculum learning to enhance its semantic comprehension. We evaluate CodeExecutor on code execution and show its promising performance and limitations. We also demonstrate its potential benefits for code intelligence tasks such as zero-shot code-to-code…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

microsoft/CodeBERT
pytorchOfficial

Datasets

notoriousdto/synthetic-elisp-alpha-0.1
dataset· 9 dl
9 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSoftware Engineering Research · Software System Performance and Reliability · Topic Modeling

MethodsMulti-Head Attention · Attention Is All You Need · Layer Normalization · Linear Layer · Label Smoothing · Dropout · Byte Pair Encoding · Dense Connections · Residual Connection · Adam