AST-T5: Structure-Aware Pretraining for Code Generation and Understanding
Linyuan Gong, Mostafa Elhoushi, Alvin Cheung

TL;DR
AST-T5 introduces a structure-aware pretraining method leveraging Abstract Syntax Trees to significantly improve code generation and understanding tasks without complex program analyses.
Contribution
It presents a novel AST-based pretraining paradigm that enhances code modeling by retaining structure, outperforming previous models like CodeT5 in code-to-code tasks.
Findings
AST-T5 outperforms similar-sized LMs on code tasks
Achieves 2 points higher in Bugs2Fix exact match score
Achieves 3 points higher in Java-C# Transpilation exact match score
Abstract
Large language models (LLMs) have made significant advancements in code-related tasks, yet many LLMs treat code as simple sequences, neglecting its structured nature. We introduce AST-T5, a novel pretraining paradigm that leverages the Abstract Syntax Tree (AST) for enhanced code generation, transpilation, and understanding. Using dynamic programming, our AST-Aware Segmentation retains code structure, while our AST-Aware Span Corruption objective equips the model to reconstruct various code structures. Unlike other models, AST-T5 avoids intricate program analyses or architectural changes, so it integrates seamlessly with any encoder-decoder Transformer. Evaluations show that AST-T5 consistently outperforms similar-sized LMs across various code-related tasks. Structure-awareness makes AST-T5 particularly powerful in code-to-code tasks, surpassing CodeT5 by 2 points in exact match score…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware Engineering Research · Topic Modeling · Advanced Malware Detection Techniques
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Multi-Head Attention · Attention Is All You Need · Gated Linear Unit · Inverse Square Root Schedule · Linear Layer · Adafactor · SentencePiece · Attention Dropout · Dropout
