OpenCoder: The Open Cookbook for Top-Tier Code Large Language Models
Siming Huang, Tianhao Cheng, J.K. Liu, Jiaran Hao, Liuyihan Song, Yang, Xu, J. Yang, Jiaheng Liu, Chenchen Zhang, Linzheng Chai, Ruifeng Yuan,, Zhaoxiang Zhang, Jie Fu, Qian Liu, Ge Zhang, Zili Wang, Yuan Qi, Yinghui Xu,, Wei Chu

TL;DR
OpenCoder is a high-performance open-source code LLM accompanied by comprehensive data, training protocols, and experimental results to foster transparent and reproducible research in code AI.
Contribution
We introduce OpenCoder, a top-tier open-source code LLM with complete data, training pipeline, and experimental details for scientific transparency and reproducibility.
Findings
Achieves performance comparable to leading proprietary models.
Provides open access to model weights, data, and training protocols.
Establishes key ingredients for building high-quality code LLMs.
Abstract
Large language models (LLMs) for code have become indispensable in various domains, including code generation, reasoning tasks and agent systems. While open-access code LLMs are increasingly approaching the performance levels of proprietary models, high-quality code LLMs suitable for rigorous scientific investigation, particularly those with reproducible data processing pipelines and transparent training protocols, remain limited. The scarcity is due to various challenges, including resource constraints, ethical considerations, and the competitive advantages of keeping models advanced. To address the gap, we introduce OpenCoder, a top-tier code LLM that not only achieves performance comparable to leading models but also serves as an "open cookbook" for the research community. Unlike most prior efforts, we release not only model weights and inference code, but also the reproducible…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗QuantFactory/OpenCoder-8B-Base-GGUFmodel· 427 dl· ♡ 4427 dl♡ 4
- 🤗infly/OpenCoder-1.5B-Basemodel· 129 dl· ♡ 23129 dl♡ 23
- 🤗infly/OpenCoder-8B-Basemodel· 1.4k dl· ♡ 311.4k dl♡ 31
- 🤗infly/OpenCoder-1.5B-Instructmodel· 705 dl· ♡ 47705 dl♡ 47
- 🤗infly/OpenCoder-8B-Instructmodel· 13k dl· ♡ 20213k dl♡ 202
- 🤗QuantFactory/OpenCoder-8B-Instruct-GGUFmodel· 618 dl· ♡ 6618 dl♡ 6
- 🤗QuantFactory/OpenCoder-1.5B-Instruct-GGUFmodel· 504 dl· ♡ 4504 dl♡ 4
- 🤗QuantFactory/OpenCoder-1.5B-Base-GGUFmodel· 284 dl· ♡ 1284 dl♡ 1
- 🤗cortexso/opencodermodel· 76 dl76 dl
- 🤗OpenCoder-LLM/OpenCoder-1.5B-Base-Checkpointsmodel· ♡ 1♡ 1
Videos
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling
