CodeShell Technical Report
Rui Xie, Zhengran Zeng, Zhuohao Yu, Chang Gao, Shikun Zhang, Wei Ye

TL;DR
This paper introduces CodeShell-Base, a 7-billion-parameter code-focused language model with advanced architecture and high-quality data, demonstrating superior performance in code comprehension and generation tasks across multiple programming languages.
Contribution
The paper presents a novel 7-billion-parameter model with integrated architectural enhancements and a comprehensive data pre-processing pipeline, achieving state-of-the-art results in code understanding and generation.
Findings
Outperforms CodeLlama on Humaneval after 500 billion tokens training
Curated 100 billion high-quality GitHub code data
Demonstrates strong performance across Python, Java, C++ datasets
Abstract
Code large language models mark a pivotal breakthrough in artificial intelligence. They are specifically crafted to understand and generate programming languages, significantly boosting the efficiency of coding development workflows. In this technical report, we present CodeShell-Base, a seven billion-parameter foundation model with 8K context length, showcasing exceptional proficiency in code comprehension. By incorporating Grouped-Query Attention and Rotary Positional Embedding into GPT-2, CodeShell-Base integrates the structural merits of StarCoder and CodeLlama and forms its unique architectural design. We then carefully built a comprehensive data pre-processing process, including similar data deduplication, perplexity-based data filtering, and model-based data filtering. Through this process, We have curated 100 billion high-quality pre-training data from GitHub. Benefiting from…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware System Performance and Reliability
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Softmax · Discriminative Fine-Tuning · Feedforward Network · Linear Layer · Dropout · Dense Connections · Adam · Grouped-query attention
