Ecosystem of Large Language Models for Code
Zhou Yang, Jieke Shi, Premkumar Devanbu, David Lo

TL;DR
This paper analyzes the ecosystem of large language models for code, focusing on datasets, models, reuse practices, and publication norms, revealing influential entities and unique licensing and documentation patterns.
Contribution
It provides a comprehensive analysis of the code model ecosystem using Hugging Face data, categorizes reuse practices, and examines documentation and licensing trends.
Findings
Ecosystem follows a power-law distribution with few dominant models and datasets.
Top reuse practices include fine-tuning, architecture sharing, and quantization.
Documentation and licensing practices differ from general AI repositories.
Abstract
The availability of vast amounts of publicly accessible data of source code and the advances in modern language models, coupled with increasing computational resources, have led to a remarkable surge in the development of large language models for code (LLM4Code, for short). The interaction between code datasets and models gives rise to a complex ecosystem characterized by intricate dependencies that are worth studying. This paper introduces a pioneering analysis of the code model ecosystem. Utilizing Hugging Face -- the premier hub for transformer-based models -- as our primary source, we curate a list of datasets and models that are manually confirmed to be relevant to software engineering. By analyzing the ecosystem, we first identify the popular and influential datasets, models, and contributors. The popularity is quantified by various metrics, including the number of downloads, the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsComputational Physics and Python Applications · Robotics and Automated Systems · Natural Language Processing Techniques
