OSS-Bench: Benchmark Generator for Coding LLMs
Yuancheng Jiang, Roland Yap, Zhenkai Liang

TL;DR
OSS-Bench is a scalable benchmark generator that evaluates LLMs on real-world open-source software tasks, focusing on code correctness, compilability, and memory safety, revealing insights into model behavior and security understanding.
Contribution
It introduces OSS-Bench, a novel automated framework for large-scale, live evaluation of coding LLMs using real OSS code and multiple robust metrics.
Findings
Profiles 17 diverse LLMs revealing behavioral patterns
Identifies inconsistencies between model size and performance
Highlights LLMs' limited understanding of low-level security
Abstract
In light of the rapid adoption of AI coding assistants, LLM-assisted development has become increasingly prevalent, creating an urgent need for robust evaluation of generated code quality. Existing benchmarks often require extensive manual effort to create static datasets, rely on indirect or insufficiently challenging tasks, depend on non-scalable ground truth, or neglect critical low-level security evaluations, particularly memory-safety issues. In this work, we introduce OSS-Bench, a benchmark generator that automatically constructs large-scale, live evaluation tasks from real-world open-source software. OSS-Bench replaces functions with LLM-generated code and evaluates them using three natural metrics: compilability, functional correctness, and memory safety, leveraging robust signals like compilation failures, test-suite violations, and sanitizer alerts as ground truth. In our…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Data Storage Technologies · Digital Rights Management and Security · Algorithms and Data Compression
