AST-T5: Structure-Aware Pretraining for Code Generation and   Understanding

Linyuan Gong; Mostafa Elhoushi; Alvin Cheung

arXiv:2401.03003·cs.SE·June 25, 2024·6 cites

AST-T5: Structure-Aware Pretraining for Code Generation and Understanding

Linyuan Gong, Mostafa Elhoushi, Alvin Cheung

PDF

Open Access 1 Repo 1 Models

TL;DR

AST-T5 introduces a structure-aware pretraining method leveraging Abstract Syntax Trees to significantly improve code generation and understanding tasks without complex program analyses.

Contribution

It presents a novel AST-based pretraining paradigm that enhances code modeling by retaining structure, outperforming previous models like CodeT5 in code-to-code tasks.

Findings

01

AST-T5 outperforms similar-sized LMs on code tasks

02

Achieves 2 points higher in Bugs2Fix exact match score

03

Achieves 3 points higher in Java-C# Transpilation exact match score

Abstract

Large language models (LLMs) have made significant advancements in code-related tasks, yet many LLMs treat code as simple sequences, neglecting its structured nature. We introduce AST-T5, a novel pretraining paradigm that leverages the Abstract Syntax Tree (AST) for enhanced code generation, transpilation, and understanding. Using dynamic programming, our AST-Aware Segmentation retains code structure, while our AST-Aware Span Corruption objective equips the model to reconstruct various code structures. Unlike other models, AST-T5 avoids intricate program analyses or architectural changes, so it integrates seamlessly with any encoder-decoder Transformer. Evaluations show that AST-T5 consistently outperforms similar-sized LMs across various code-related tasks. Structure-awareness makes AST-T5 particularly powerful in code-to-code tasks, surpassing CodeT5 by 2 points in exact match score…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

gonglinyuan/ast_t5
pytorchOfficial

Models

🤗
gonglinyuan/ast_t5_base
model· 98 dl· ♡ 6
98 dl♡ 6

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSoftware Engineering Research · Topic Modeling · Advanced Malware Detection Techniques

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Multi-Head Attention · Attention Is All You Need · Gated Linear Unit · Inverse Square Root Schedule · Linear Layer · Adafactor · SentencePiece · Attention Dropout · Dropout