The Distribution of Program Sizes and Its Implications: An Eclipse Case Study
Hongyu Zhang, Hee Beng Kuan Tan, Michele Marchesi

TL;DR
This study analyzes the distribution of program sizes in large software systems using Eclipse data, confirming a lognormal distribution and exploring its implications for size estimation and defect prediction.
Contribution
It replicates previous findings on size distribution and investigates how size distribution impacts size estimation and defect prediction in large Java systems.
Findings
Program sizes follow a lognormal distribution.
Large programs account for most defects.
Defects across programs follow Weibull distribution when ranked by size.
Abstract
A large software system is often composed of many inter-related programs of different sizes. Using the public Eclipse dataset, we replicate our previous study on the distribution of program sizes. Our results confirm that the program sizes follow the lognormal distribution. We also investigate the implications of the program size distribution on size estimation and quality predication. We find that the nature of size distribution can be used to estimate the size of a large Java system. We also find that a small percentage of largest programs account for a large percentage of defects, and the number of defects across programs follows the Weibull distribution when the programs are ranked by their sizes. Our results show that the distribution of program sizes is an important property for understanding large and complex software systems.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware Engineering Research · Software Reliability and Analysis Research · Software Testing and Debugging Techniques
