AlchemistCoder: Harmonizing and Eliciting Code Capability by Hindsight Tuning on Multi-source Data
Zifan Song, Yudong Wang, Wenwei Zhang, Kuikun Liu, Chengqi Lyu, Demin, Song, Qipeng Guo, Hang Yan, Dahua Lin, Kai Chen, Cairong Zhao

TL;DR
AlchemistCoder enhances code generation and generalization in LLMs by fine-tuning on multi-source data with a novel harmonization method using hindsight relabeling and data construction processes, outperforming comparable models.
Contribution
The paper introduces AlchemistCoder, a new fine-tuning approach that harmonizes multi-source code data and incorporates data construction tasks to improve code LLM performance.
Findings
Outperforms models of similar size (6.7B/7B)
Rivals or surpasses larger models (15B/33B/70B)
Demonstrates improved instruction-following and code understanding
Abstract
Open-source Large Language Models (LLMs) and their specialized variants, particularly Code LLMs, have recently delivered impressive performance. However, previous Code LLMs are typically fine-tuned on single-source data with limited quality and diversity, which may insufficiently elicit the potential of pre-trained Code LLMs. In this paper, we present AlchemistCoder, a series of Code LLMs with enhanced code generation and generalization capabilities fine-tuned on multi-source data. To achieve this, we pioneer to unveil inherent conflicts among the various styles and qualities in multi-source code corpora and introduce data-specific prompts with hindsight relabeling, termed AlchemistPrompts, to harmonize different data sources and instruction-response pairs. Additionally, we propose incorporating the data construction process into the fine-tuning data as code comprehension tasks,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗internlm/AlchemistCoder-DS-6.7Bmodel· 22 dl· ♡ 1122 dl♡ 11
- 🤗internlm/AlchemistCoder-CL-7Bmodel· 20 dl· ♡ 220 dl♡ 2
- 🤗internlm/AlchemistCoder-L-7Bmodel· 11 dl· ♡ 311 dl♡ 3
- 🤗QuantFactory/AlchemistCoder-L-7B-GGUFmodel· 22 dl· ♡ 122 dl♡ 1
- 🤗lmstudio-community/AlchemistCoder-DS-6.7B-GGUFmodel· 34 dl· ♡ 334 dl♡ 3
- 🤗lmstudio-community/AlchemistCoder-L-7B-GGUFmodel· 93 dl· ♡ 393 dl♡ 3
- 🤗cgus/AlchemistCoder-DS-6.7B-exl2model· 7 dl7 dl
- 🤗RichardErkhov/internlm_-_AlchemistCoder-L-7B-ggufmodel· 6 dl6 dl
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsScientific Computing and Data Management · Web Data Mining and Analysis · Semantic Web and Ontologies
