StarCoder: may the source be with you!
Raymond Li, Loubna Ben Allal, Yangtian Zi, Niklas Muennighoff, Denis, Kocetkov, Chenghao Mou, Marc Marone, Christopher Akiki, Jia Li, Jenny Chim,, Qian Liu, Evgenii Zheltonozhskii, Terry Yue Zhuo, Thomas Wang, Olivier, Dehaene, Mishig Davaadorj, Joel Lamy-Poirier

TL;DR
StarCoder is a large open-source code generation model with 15.5B parameters, trained on extensive data, outperforming many existing models and supporting multiple programming languages with safety and attribution features.
Contribution
Introduces StarCoder, a new open-source 15.5B parameter code LLM with advanced capabilities, comprehensive evaluation, and safety measures, advancing open scientific collaboration in code AI.
Findings
StarCoder outperforms all open-source multi-language code models.
StarCoder achieves 40% pass@1 on HumanEval.
StarCoder maintains performance across multiple programming languages.
Abstract
The BigCode community, an open-scientific collaboration working on the responsible development of Large Language Models for Code (Code LLMs), introduces StarCoder and StarCoderBase: 15.5B parameter models with 8K context length, infilling capabilities and fast large-batch inference enabled by multi-query attention. StarCoderBase is trained on 1 trillion tokens sourced from The Stack, a large collection of permissively licensed GitHub repositories with inspection tools and an opt-out process. We fine-tuned StarCoderBase on 35B Python tokens, resulting in the creation of StarCoder. We perform the most comprehensive evaluation of Code LLMs to date and show that StarCoderBase outperforms every open Code LLM that supports multiple programming languages and matches or outperforms the OpenAI code-cushman-001 model. Furthermore, StarCoder outperforms every model that is fine-tuned on Python,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗bigcode/starcodermodel· 10k dl· ♡ 293210k dl♡ 2932
- 🤗codesage/codesage-large-v2model· 1.3k dl· ♡ 131.3k dl♡ 13
- 🤗bigcode/starcoderbase-megatronmodel· ♡ 2♡ 2
- 🤗bigcode/starcoderbasemodel· 56 dl· ♡ 41656 dl♡ 416
- 🤗GeorgiaTechResearchInstitute/starcoder-gpteacher-code-instructmodel· 825 dl· ♡ 81825 dl♡ 81
- 🤗bigcode/starcoderplusmodel· 67 dl· ♡ 21967 dl♡ 219
- 🤗bigcode/starcoder-megatronmodel· ♡ 6♡ 6
- 🤗NeoDim/starcoderbase-GGMLmodel· ♡ 4♡ 4
- 🤗NeoDim/starcoder-GGMLmodel· ♡ 27♡ 27
- 🤗michaelfeil/ct2fast-starcodermodel· 10 dl· ♡ 1310 dl♡ 13
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Software Engineering Research · Machine Learning and Data Classification
