cAST: Enhancing Code Retrieval-Augmented Generation with Structural Chunking via Abstract Syntax Tree

Yilin Zhang; Xinran Zhao; Zora Zhiruo Wang; Chenyang Yang; Jiayi Wei; Tongshuang Wu

arXiv:2506.15655·cs.SE·October 6, 2025

cAST: Enhancing Code Retrieval-Augmented Generation with Structural Chunking via Abstract Syntax Tree

Yilin Zhang, Xinran Zhao, Zora Zhiruo Wang, Chenyang Yang, Jiayi Wei, Tongshuang Wu

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper introduces a structure-aware chunking method using Abstract Syntax Trees to improve code retrieval and generation quality in Retrieval-Augmented Generation systems.

Contribution

It presents a novel AST-based chunking approach that maintains semantic coherence, outperforming line-based heuristics across multiple code tasks.

Findings

01

Boosted Recall@5 by 4.3 points on RepoEval

02

Increased Pass@1 by 2.67 points on SWE-bench

03

Improved semantic coherence of code chunks

Abstract

Retrieval-Augmented Generation (RAG) has become essential for large-scale code generation, grounding predictions in external code corpora to improve actuality. However, a critical yet underexplored aspect of RAG pipelines is chunking -- the process of dividing documents into retrievable units. Existing line-based chunking heuristics often break semantic structures, splitting functions or merging unrelated code, which can degrade generation quality. We propose chunking via Abstract Syntax Trees (\ourwork), a structure-aware method that recursively breaks large AST nodes into smaller chunks and merges sibling nodes while respecting size limits. This approach generates self-contained, semantically coherent units across programming languages and tasks, improving performance on diverse code generation tasks, e.g., boosting Recall@5 by 4.3 points on RepoEval retrieval and Pass@1 by 2.67…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

yilinjz/astchunk
noneOfficial

Videos

cAST: Enhancing Code Retrieval-Augmented Generation with Structural Chunking via Abstract Syntax Tree· underline

Taxonomy

TopicsSoftware Engineering Research · Natural Language Processing Techniques · Topic Modeling

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Dropout · Dropout · Byte Pair Encoding · Softmax · Dense Connections · Layer Normalization · Linear Warmup With Linear Decay · BERT · BART