Infinite-Instruct: Synthesizing Scaling Code instruction Data with Bidirectional Synthesis and Static Verification

Wenjing Xing; Wenke Lu; Yeheng Duan; Bing Zhao; Zhenghui kang; Yaolong Wang; Kai Gao; Lei Qiao

arXiv:2505.23177·cs.CL·May 30, 2025

Infinite-Instruct: Synthesizing Scaling Code instruction Data with Bidirectional Synthesis and Static Verification

Wenjing Xing, Wenke Lu, Yeheng Duan, Bing Zhao, Zhenghui kang, Yaolong Wang, Kai Gao, Lei Qiao

PDF

Open Access

TL;DR

Infinite-Instruct is a scalable framework that synthesizes high-quality, logically consistent code instruction data using bidirectional methods and static verification, significantly improving code generation performance of large language models.

Contribution

The paper introduces a novel automated framework combining reverse and backfeeding construction with static code analysis to generate diverse, high-quality code instruction datasets for LLM training.

Findings

01

Achieved up to 36.95% performance improvement on code benchmarks.

02

Generated datasets enable comparable performance with less fine-tuning data.

03

Open-sourced datasets facilitate further research in code instruction synthesis.

Abstract

Traditional code instruction data synthesis methods suffer from limited diversity and poor logic. We introduce Infinite-Instruct, an automated framework for synthesizing high-quality question-answer pairs, designed to enhance the code generation capabilities of large language models (LLMs). The framework focuses on improving the internal logic of synthesized problems and the quality of synthesized code. First, "Reverse Construction" transforms code snippets into diverse programming problems. Then, through "Backfeeding Construction," keywords in programming problems are structured into a knowledge graph to reconstruct them into programming problems with stronger internal logic. Finally, a cross-lingual static code analysis pipeline filters invalid samples to ensure data quality. Experiments show that on mainstream code generation benchmarks, our fine-tuned models achieve an average…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSoftware Engineering Research · Software Testing and Debugging Techniques · Natural Language Processing Techniques