CodeShell Technical Report

Rui Xie; Zhengran Zeng; Zhuohao Yu; Chang Gao; Shikun Zhang; Wei Ye

arXiv:2403.15747·cs.SE·March 26, 2024·2 cites

CodeShell Technical Report

Rui Xie, Zhengran Zeng, Zhuohao Yu, Chang Gao, Shikun Zhang, Wei Ye

PDF

Open Access

TL;DR

This paper introduces CodeShell-Base, a 7-billion-parameter code-focused language model with advanced architecture and high-quality data, demonstrating superior performance in code comprehension and generation tasks across multiple programming languages.

Contribution

The paper presents a novel 7-billion-parameter model with integrated architectural enhancements and a comprehensive data pre-processing pipeline, achieving state-of-the-art results in code understanding and generation.

Findings

01

Outperforms CodeLlama on Humaneval after 500 billion tokens training

02

Curated 100 billion high-quality GitHub code data

03

Demonstrates strong performance across Python, Java, C++ datasets

Abstract

Code large language models mark a pivotal breakthrough in artificial intelligence. They are specifically crafted to understand and generate programming languages, significantly boosting the efficiency of coding development workflows. In this technical report, we present CodeShell-Base, a seven billion-parameter foundation model with 8K context length, showcasing exceptional proficiency in code comprehension. By incorporating Grouped-Query Attention and Rotary Positional Embedding into GPT-2, CodeShell-Base integrates the structural merits of StarCoder and CodeLlama and forms its unique architectural design. We then carefully built a comprehensive data pre-processing process, including similar data deduplication, perplexity-based data filtering, and model-based data filtering. Through this process, We have curated 100 billion high-quality pre-training data from GitHub. Benefiting from…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSoftware System Performance and Reliability

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Softmax · Discriminative Fine-Tuning · Feedforward Network · Linear Layer · Dropout · Dense Connections · Adam · Grouped-query attention