OSS-Bench: Benchmark Generator for Coding LLMs

Yuancheng Jiang; Roland Yap; Zhenkai Liang

arXiv:2505.12331·cs.SE·May 21, 2025

OSS-Bench: Benchmark Generator for Coding LLMs

Yuancheng Jiang, Roland Yap, Zhenkai Liang

PDF

Open Access 1 Repo

TL;DR

OSS-Bench is a scalable benchmark generator that evaluates LLMs on real-world open-source software tasks, focusing on code correctness, compilability, and memory safety, revealing insights into model behavior and security understanding.

Contribution

It introduces OSS-Bench, a novel automated framework for large-scale, live evaluation of coding LLMs using real OSS code and multiple robust metrics.

Findings

01

Profiles 17 diverse LLMs revealing behavioral patterns

02

Identifies inconsistencies between model size and performance

03

Highlights LLMs' limited understanding of low-level security

Abstract

In light of the rapid adoption of AI coding assistants, LLM-assisted development has become increasingly prevalent, creating an urgent need for robust evaluation of generated code quality. Existing benchmarks often require extensive manual effort to create static datasets, rely on indirect or insufficiently challenging tasks, depend on non-scalable ground truth, or neglect critical low-level security evaluations, particularly memory-safety issues. In this work, we introduce OSS-Bench, a benchmark generator that automatically constructs large-scale, live evaluation tasks from real-world open-source software. OSS-Bench replaces functions with LLM-generated code and evaluates them using three natural metrics: compilability, functional correctness, and memory safety, leveraging robust signals like compilation failures, test-suite violations, and sanitizer alerts as ground truth. In our…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

oss-bench/oss-bench
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Data Storage Technologies · Digital Rights Management and Security · Algorithms and Data Compression