TL;DR
This paper challenges the assumption that extensive datasets and complex optimizers are necessary in Software Engineering, demonstrating that simple, lightweight methods often achieve near-optimal results with minimal data.
Contribution
It introduces the data-light challenge, formalizes labeling, proposes lightweight baselines, and provides empirical results showing when simple methods suffice in SE tasks.
Findings
Simple methods achieve over 90% of best results with few labels
Lightweight approaches perform as well as complex optimizers in many cases
Few samples are sufficient for rapid, cost-effective SE guidance
Abstract
Much of Software Engineering (SE) research assumes that progress depends on massive datasets and CPU-intensive optimizers. Yet has this assumption been rigorously tested? The counter-evidence presented in this paper suggests otherwise. For over 100 optimization tasks from recent SE papers (including software configuration, performance tuning, product line engineering, project health forecasting, defect prediction, software testing, software process and cost estimation, and cross-domain generalization datasets), even with just a few dozen labels, very simple methods (e.g., diversity sampling, a minimal Bayesian learner, its distance-based non-parametric variant, or random probes) achieve over 90% of the best reported results. Furthermore, these simple methods perform just as well as more complex state-of-the-the-art optimizers like SMAC, TPE, DEHB etc. While some tasks would require…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
