Unseen-Codebases-Domain Data Synthesis and Training Based on Code Graphs
Guangsheng Ou, Qiming Zhang, Sirong Chen, Anji Li, Dong Xu, Tiancheng Luo, Dekun Dai, Cuiyun Gao, Long Wang, Jun Zhou, Mingwei Liu, Zibin Zheng

TL;DR
This paper introduces UCD-Training, a novel two-stage training framework that synthesizes reasoning-aware data from unseen codebases using code graphs, improving large language models' understanding of new software environments.
Contribution
The paper presents UCD-Training, a new method combining dependency-preserving pretraining and graph-grounded fine-tuning with synthesized data for better reasoning on unseen codebases.
Findings
Enhanced code generation performance on unseen codebases.
Effective reasoning trace incorporation improves model understanding.
New benchmark, UnseenCodeBench, for evaluating unseen codebase reasoning.
Abstract
In the context of newly release software frameworks, large language models (LLMs) often exhibit poor performance and a high rate of hallucination, as they are not exposed to such environments during training. Although inference-time augmentation techniques such as retrieval-augmented generation (RAG) can partially mitigate hallucinations, knowledge injection through prompting alone is insufficient to enable models to fully understand the intrinsic relationships among different components of a codebase, or to reason about the correct compositions and apply. Although explicit knowledge injection can be achieved through post-training, compared with public code domains, unseen codebases typically provide only source code and lack large volumes of high-quality, usage-oriented code that can be directly leveraged as training data. Consequently, existing data synthesis approaches are…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware Engineering Research · Topic Modeling · Natural Language Processing Techniques
