Crystal: Illuminating LLM Abilities on Language and Code
Tianhua Tao, Junbo Li, Bowen Tan, Hongyi Wang, William Marshall,, Bhargav M Kanakiya, Joel Hestness, Natalia Vassilieva, Zhiqiang Shen, Eric P., Xing, Zhengzhong Liu

TL;DR
This paper introduces Crystal, a pretraining strategy that effectively integrates natural language and coding skills in large language models, achieving competitive performance with improved data efficiency and providing comprehensive open resources.
Contribution
It proposes a two-phase pretraining approach with adjusted code/language ratios, demonstrating enhanced capabilities and data efficiency in a unified LLM for language and code tasks.
Findings
Crystal matches Llama 2 and Code Llama in language and code performance.
Crystal uses 1.4 trillion tokens, less than 2 trillion used by comparable models.
Careful data recipe design is crucial for training efficiency and performance.
Abstract
Large Language Models (LLMs) specializing in code generation (which are also often referred to as code LLMs), e.g., StarCoder and Code Llama, play increasingly critical roles in various software development scenarios. It is also crucial for code LLMs to possess both code generation and natural language abilities for many specific applications, such as code snippet retrieval using natural language or code explanations. The intricate interaction between acquiring language and coding skills complicates the development of strong code LLMs. Furthermore, there is a lack of thorough prior studies on the LLM pretraining strategy that mixes code and natural language. In this work, we propose a pretraining strategy to enhance the integration of natural language and coding capabilities within a single LLM. Specifically, it includes two phases of training with appropriately adjusted code/language…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques
MethodsLLaMA
