TL;DR
This paper introduces LinuxData, a large-scale dataset of Linux kernel configurations across multiple versions, enabling advanced research in kernel configuration analysis, prediction, and evolution modeling.
Contribution
The paper provides the first comprehensive, publicly accessible dataset of Linux kernel configurations with detailed measurements, supporting machine learning and transfer learning research.
Findings
Dataset includes over 240,000 configurations from versions 4.13 to 5.8.
Enables research in feature selection and prediction models.
Facilitates reproducibility and new insights into kernel configuration evolution.
Abstract
Configuring the Linux kernel to meet specific requirements, such as binary size, is highly challenging due to its immense complexity-with over 15,000 interdependent options evolving rapidly across different versions. Although several studies have explored sampling strategies and machine learning methods to understand and predict the impact of configuration options, the literature still lacks a comprehensive and large-scale dataset encompassing multiple kernel versions along with detailed quantitative measurements. To bridge this gap, we introduce LinuxData, an accessible collection of kernel configurations spanning several kernel releases, specifically from versions 4.13 to 5.8. This dataset, gathered through automated tools and build processes, comprises over 240,000 kernel configurations systematically labeled with compilation outcomes and binary sizes. By providing detailed records…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
