Structured Code Representations Enable Data-Efficient Adaptation of Code Language Models
Mayank Agarwal, Yikang Shen, Bailin Wang, Yoon Kim, Jie Chen

TL;DR
This paper demonstrates that representing code as parse trees and further pre-training on these structures significantly improves data efficiency and performance of pre-trained code models, especially with limited training data.
Contribution
The work introduces a method to adapt pre-trained code models using program structures like parse trees, enhancing data efficiency without changing the model architecture.
Findings
Improved performance on code tasks with structural pre-training.
Significant gains in low-data scenarios.
Structural adaptation benefits models pre-trained only on plain text.
Abstract
Current language models tailored for code tasks often adopt the pre-training-then-fine-tuning paradigm from natural language processing, modeling source code as plain text. This approach, however, overlooks the unambiguous structures inherent in programming languages. In this work, we explore data-efficient adaptation of pre-trained code models by further pre-training and fine-tuning them with program structures. Specifically, we represent programs as parse trees -- also known as concrete syntax trees (CSTs) -- and adapt pre-trained models on serialized CSTs. Although the models that we adapt have been pre-trained only on the surface form of programs, we find that a small amount of continual pre-training and fine-tuning on CSTs without changing the model architecture yields improvements over the baseline approach across various code tasks. The improvements are found to be particularly…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware Engineering Research · Topic Modeling · Software Testing and Debugging Techniques
