PerfCurator: Curating a large-scale dataset of performance bug-related commits from public repositories
Md Abul Kalam Azad, Manoj Alexender, Matthew Alexender, Syed Salauddin, Mohammad Tariq, Foyzul Hassan, Probir Roy

TL;DR
PerfCurator is a scalable tool that uses a specialized BERT model to mine large-scale datasets of performance bug-related commits from open-source repositories, aiding research and mitigation strategies.
Contribution
It introduces PerfCurator and PcBERT-KD, enabling efficient large-scale collection of performance bug commits across multiple programming languages.
Findings
Achieved high accuracy with PcBERT-KD comparable to larger models
Constructed a dataset with over 400K performance bug commits in multiple languages
Demonstrated dataset's effectiveness in improving bug detection systems
Abstract
Performance bugs challenge software development, degrading performance and wasting computational resources. Software developers invest substantial effort in addressing these issues. Curating these performance bugs can offer valuable insights to the software engineering research community, aiding in developing new mitigation strategies. However, there is no large-scale open-source performance bugs dataset available. To bridge this gap, we propose PerfCurator, a repository miner that collects performance bug-related commits at scale. PerfCurator employs PcBERT-KD, a 125M parameter BERT model trained to classify performance bug-related commits. Our evaluation shows PcBERT-KD achieves accuracy comparable to 7 billion parameter LLMs but with significantly lower computational overhead, enabling cost-effective deployment on CPU clusters. Utilizing PcBERT-KD as the core component, we deployed…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsScientific Computing and Data Management · Software Engineering Research · Software System Performance and Reliability
