aiXcoder-7B: A Lightweight and Effective Large Language Model for Code Processing
Siyuan Jiang, Jia Li, He Zong, Huanyu Liu, Hao Zhu, Shukai Hu, Erlu, Li, Jiazheng Ding, Yu Han, Wei Ning, Gen Wang, Yihong Dong, Kechi Zhang, Ge, Li

TL;DR
aiXcoder-7B is a compact yet highly accurate code completion language model that leverages multi-objective training, diverse data sampling, and extensive high-quality data to outperform larger models.
Contribution
The paper introduces aiXcoder-7B, a lightweight LLM for code with novel training objectives and data strategies, achieving superior performance with fewer parameters.
Findings
aiXcoder-7B outperforms six similar-sized LLMs in code completion benchmarks.
It surpasses larger models like StarCoder2-15B and CodeLlama-34B in accuracy.
The model has been open-sourced and widely adopted, with over 2,200 GitHub stars.
Abstract
Large Language Models (LLMs) have been widely used in code completion, and researchers are focusing on scaling up LLMs to improve their accuracy. However, larger LLMs have lower inference efficiency, affecting developers' experience and productivity. In this paper, we propose a lightweight and effective LLM for code completion named aiXcoder-7B. Compared to existing LLMs, aiXcoder-7B achieves higher code completion accuracy while having smaller scales (i.e., 7 billion parameters). We attribute the superiority of aiXcoder-7B to three key factors: (1) Multi-objective training. We employ three training objectives, one of which is our proposed Structured Fill-In-the-Middle (SFIM). SFIM considers the syntax structures in code and effectively improves the performance of LLMs for code. (2) Diverse data sampling strategies. They consider inter-file relationships and enhance the capability of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Machine Learning in Bioinformatics · Topic Modeling
