Seed-Coder: Let the Code Model Curate Data for Itself
ByteDance Seed, Yuyu Zhang, Jing Su, Yifan Sun, Chenguang Xi, Xia Xiao, Shen Zheng, Anxiang Zhang, Kaibo Liu, Daoguang Zan, Tao Sun, Jinhua Zhu, Shulin Xin, Dong Huang, Yetao Bai, Lixin Dong, Chao Li, Jianchong Chen, Hanzhi Zhou, Yifan Huang, Guanghan Ning, Xierui Song

TL;DR
Seed-Coder introduces an open-source series of models that autonomously curate high-quality code data using LLM-based pipelines, reducing human effort and biases, and achieving state-of-the-art performance in code tasks.
Contribution
The paper presents a novel, fully model-driven data pipeline for code pretraining, minimizing human involvement and bias, and introduces models with advanced reasoning capabilities.
Findings
Achieves state-of-the-art results among open-source models of similar size.
Surpasses some larger models in code generation and reasoning tasks.
Demonstrates effective multi-step code reasoning with LongCoT reinforcement learning.
Abstract
Code data in large language model (LLM) pretraining is recognized crucial not only for code-related tasks but also for enhancing general intelligence of LLMs. Current open-source LLMs often heavily rely on human effort to produce their code pretraining data, such as employing hand-crafted filtering rules tailored to individual programming languages, or using human-annotated data to train quality filters. However, these approaches are inherently limited in scalability, prone to subjective biases, and costly to extend and maintain across diverse programming languages. To address these challenges, we introduce Seed-Coder, a series of open-source LLMs comprising base, instruct and reasoning models of 8B size, minimizing human involvement in data construction. Our code pretraining data is produced by a model-centric data pipeline, which predominantly leverages LLMs for scoring and filtering…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗ByteDance-Seed/Seed-Coder-8B-Basemodel· 1.9k dl· ♡ 651.9k dl♡ 65
- 🤗ByteDance-Seed/Seed-Coder-8B-Instructmodel· 7.7k dl· ♡ 1107.7k dl♡ 110
- 🤗ByteDance-Seed/Seed-Coder-8B-Reasoningmodel· 145 dl· ♡ 150145 dl♡ 150
- 🤗ByteDance-Seed/Seed-Coder-8B-Reasoning-bf16model· 34 dl· ♡ 2034 dl♡ 20
- 🤗Mungert/Seed-Coder-8B-Reasoning-GGUFmodel· 182 dl· ♡ 1182 dl♡ 1
- 🤗cgus/Seed-Coder-8B-Base-exl2model· 14 dl14 dl
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware Engineering Research · Topic Modeling · Natural Language Processing Techniques
