AlchemistCoder: Harmonizing and Eliciting Code Capability by Hindsight   Tuning on Multi-source Data

Zifan Song; Yudong Wang; Wenwei Zhang; Kuikun Liu; Chengqi Lyu; Demin; Song; Qipeng Guo; Hang Yan; Dahua Lin; Kai Chen; Cairong Zhao

arXiv:2405.19265·cs.CL·February 4, 2025

AlchemistCoder: Harmonizing and Eliciting Code Capability by Hindsight Tuning on Multi-source Data

Zifan Song, Yudong Wang, Wenwei Zhang, Kuikun Liu, Chengqi Lyu, Demin, Song, Qipeng Guo, Hang Yan, Dahua Lin, Kai Chen, Cairong Zhao

PDF

Open Access 2 Repos 8 Models

TL;DR

AlchemistCoder enhances code generation and generalization in LLMs by fine-tuning on multi-source data with a novel harmonization method using hindsight relabeling and data construction processes, outperforming comparable models.

Contribution

The paper introduces AlchemistCoder, a new fine-tuning approach that harmonizes multi-source code data and incorporates data construction tasks to improve code LLM performance.

Findings

01

Outperforms models of similar size (6.7B/7B)

02

Rivals or surpasses larger models (15B/33B/70B)

03

Demonstrates improved instruction-following and code understanding

Abstract

Open-source Large Language Models (LLMs) and their specialized variants, particularly Code LLMs, have recently delivered impressive performance. However, previous Code LLMs are typically fine-tuned on single-source data with limited quality and diversity, which may insufficiently elicit the potential of pre-trained Code LLMs. In this paper, we present AlchemistCoder, a series of Code LLMs with enhanced code generation and generalization capabilities fine-tuned on multi-source data. To achieve this, we pioneer to unveil inherent conflicts among the various styles and qualities in multi-source code corpora and introduce data-specific prompts with hindsight relabeling, termed AlchemistPrompts, to harmonize different data sources and instruction-response pairs. Additionally, we propose incorporating the data construction process into the fine-tuning data as code comprehension tasks,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsScientific Computing and Data Management · Web Data Mining and Analysis · Semantic Web and Ontologies