AutoBaxBuilder: Bootstrapping Code Security Benchmarking
Tobias von Arx, Niels M\"undler, Mark Vero, Maximilian Baader, Martin Vechev

TL;DR
AutoBaxBuilder is an automated pipeline that rapidly generates security benchmarking tasks for code, reducing manual effort and enabling continuous evaluation of large language models in software security.
Contribution
It introduces an automated, efficient method for creating security benchmarks from scratch, addressing data contamination and scalability issues in prior work.
Findings
AutoBaxBuilder generates new security tasks in under 2 hours.
The pipeline aligns well with expert benchmarks and passes manual soundness checks.
It significantly reduces human effort in benchmark creation by a factor of 12.
Abstract
As large language models (LLMs) see wide adoption in software engineering, the reliable assessment of the correctness and security of LLM-generated code is crucial. Notably, prior work showed that LLMs are prone to generating code with security vulnerabilities, highlighting that security is often overlooked. These insights were enabled by specialized benchmarks crafted by security experts through significant manual effort. However, benchmarks (i) inevitably end up contaminating training data, (ii) must extend to new tasks to provide a more complete picture, and (iii) must increase in difficulty to challenge more capable LLMs. In this work, we address these challenges and present AutoBaxBuilder, an automated pipeline that generates code security benchmarking tasks from scratch. It leverages the code-understanding capabilities of LLMs combined with robust reliability checks to construct…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware Engineering Research · Information and Cyber Security · Web Application Security Vulnerabilities
