API-guided Dataset Synthesis to Finetune Large Code Models
Zongjie Li, Daoyuan Wu, Shuai Wang, Zhendong Su

TL;DR
This paper introduces DataScope, an API-guided framework for synthesizing high-quality datasets to improve fine-tuning of large code models, significantly boosting their performance in general and domain-specific tasks.
Contribution
The paper presents DataScope, a novel API-guided dataset synthesis framework with components Dsel and Dgen, improving dataset quality and model fine-tuning efficiency for large code models.
Findings
Models fine-tuned on DataScope datasets outperform larger unoptimized datasets.
API coverage-based selection enhances dataset quality in general scenarios.
Synthesized datasets lead to significant performance improvements in code models.
Abstract
Large code models (LCMs), pre-trained on vast code corpora, have demonstrated remarkable performance across a wide array of code-related tasks. Supervised fine-tuning (SFT) plays a vital role in aligning these models with specific requirements and enhancing their performance in particular domains. However, synthesizing high-quality SFT datasets poses a significant challenge due to the uneven quality of datasets and the scarcity of domain-specific datasets. Inspired by APIs as high-level abstractions of code that encapsulate rich semantic information in a concise structure, we propose DataScope, an API-guided dataset synthesis framework designed to enhance the SFT process for LCMs in both general and domain-specific scenarios. DataScope comprises two main components: Dsel and Dgen. On one hand, Dsel employs API coverage as a core metric, enabling efficient dataset synthesis in general…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRobotics and Automated Systems · Model-Driven Software Engineering Techniques · Software Testing and Debugging Techniques
MethodsShrink and Fine-Tune
